Wednesday, June 30, 2004

The Enduring Importance of Understanding Principles

This Performance Quiz #3 by Rico Mariani, as well as the accompanying solution, ought to prove a revelation to those who imagine that in this day of 3Ghz CPUs and point-and-click coding wizards, time spent profiling code or slogging through a copy of Knuth's Art of Computer Programming is so much time wasted.

Ours is a field in which products and technologies come and go at an often dizzying speed, but the fundamentals of computer science undergo little change with the passage of time, and it's unlikely there will ever come a time when the need to understand these deeper principles will cease to exist. There's a lot more to programming than putting up sites in PHP!

The funny thing about managed code is that sometimes people forget that just because we offer an easy/convenient way to do things if you're doing a quick RAD application doesn't mean that the RAD way is in fact the right way. In fact, the good old fashioned way is very likely still available and may be an excellent choice. So that brings us to Performance Quiz #3.

The question: Characterize the difference in space and speed between RadParser and ClassicParser, consider the given test input to be typical.
The C# code for the RadParser and ClassicParser he mentions are to be found on the problem page.

Tuesday, June 29, 2004

Thread-Safety vs. Reentrancy

Raymond Chen discusses the difference between the two concepts.

OS X Enhances Search Technologies

It seems like Apple's pulled out all the stops to try to answer Brad DeLong's complaints about the limitations imposed by the current level of support for file searching on most modern operating systems. Of course, Microsoft's big WinFS push just might have had something to do with this as well ...

Tiger introduces an innovative new system-wide metadata search engine that can instantly search through more than 100,000 files, documents, emails and contacts all at once.

Searching with Smarts
The metadata engine makes searching smarter, more flexible and powerful by indexing the descriptive informational items already saved within your files and documents. Metadata describes the “what, when and who” of every piece of information saved on your Mac: the kind of content, the author, edit history, format, size and many more details. Most documents, including Microsoft Word documents, Photoshop images and emails, already contain rich metadata items. By using this indexed information for searching, you can tap into tremendous power and accuracy for refining search results.

Faster Results
The engine automatically takes all the metadata inside files and enabled applications and puts the data into a high-performance index. This process occurs transparently and in the background, so you never experience lag times or slow downs during normal operation. When you make a change, such as adding a new file, receiving an email or entering a new contact, the metadata engine updates its index automatically. Results of search requests are displayed virtually as fast as you can type your query.

Application Metadata-Awareness
The metadata engine uses special importer technology to open and read heterogeneous file formats. Tiger includes importers for some of the most popular file formats. The metadata engine can be extended to any new file format, automatically adding an application’s information to your search results.

The search technologies in Mac OS X don’t stop with the metadata search engine. Whenever you perform a search, you will also be searching a powerful content index that uses the full contents of files to find matches. Even documents without any metadata are included in searches.
Apple now seems to have dealt with at least the free-text-search aspect of the scenario Brad outlined, and the company has also apparently taken to heart that the performance-crippling nature of the Indexing Service built into Windows 2000 and its successors is simply unacceptable if its indexing engine is to be worth using; there's no good reason why automatically keeping track of changes to documents should eat up loads of CPU time, Microsoft decided upon a boneheaded implementation.

As nice as this new indexing engine seems to be, however, I still think that there are ways in which it won't be able to come close to relying on Google to do one's indexing for one. Barring support for semantic features such as synonymy, metonymy, antonymy, simile, metaphor, polysemy, synecdoche and the like, no machine-based system of annotation will ever touch an implicitly hand-maintained system like Google's PageRank-based offering in terms of relevance or comprehension. Do any of the indexing systems out there even have support for something as rudimentary as word-stemming? I doubt it.

UPDATE: This Register story by Andrew Orlowski is also worth reading.

OSX 10.4 Preview: Hits and Misses

Tristan Louis has some interesting things to say about the preview of Tiger offered by Apple at the WWWDC. Some of the criticisms he makes, while technically accurate, are, I think, unfair however.

It's easy to fault Apple for not providing today all of the gee-whiz promises being made for Longhorn tomorrow, but if one looks at the difference between the Apple and Microsoft approaches as being a matter of "release early and often" vs. "knock em out with big updates every so many years", there's much to be said for Apple's more incremental approach. For one thing, it gets at least some of the useful new functionality in users' hands earlier; for another, it's simply much more likely to work, as few things are as common in software development as drawing up grand schemes on paper, only to see the entire edifice collapse during the process during the implementation stage.

MSDN Product Feedback Page

MSDN now has a bug reporting system! Microsoft really does seem to be listening after all! Hmm, maybe Robert Scoble isn't all full of cowdung after all? Being as cynical as I am about Microsoft's activities and intentions, I'm finding these changes difficult to process, but if the Redmondians can keep this sort of thing coming, I just may start to change my mind about the company everyone loves to hate.

SQL Server 2005 Express Beta

I see that the SQL Server 2005 Express Beta (the successor to MSDE) is also now available for free download, as in fact are all the rest of the Visual Studio 2005 Express Beta lineup. The only interpretion I can make of all this is that Microsoft is starting to take the threat posed by the allure of free development tools on Linux and OS X seriously. Thank God for competition!

Visual C++ Express

The first Visual C++ 2005 Express Beta (equivalent to the old Visual C++ Standard Edition) is now available online for free download, and unlike the old entry-level editions, it actually includes an optimizing compiler, which is a very pleasant surprise. This sort of openness really makes a change for Microsoft.

Saturday, June 26, 2004

Google Hacks

Sun blogger Rich Burridge discusses a book published by O'Reilly called Google Hacks, which contains, as one might guess, lots of cool tips one can use to search Google more effectively, many of which are poorly documented, if at all.

Here's a page on which a few of the hacks can be found, including my own personal favorite - a means for getting around Google's frustrating 10-term search limit.

Help comes in the form of Google's full-word wildcard (Hack #13). It turns out that Google doesn't count wildcards toward the limit.

So when you have more than 10 words, substitute a wildcard for common words like so:

"do as * say not as * do" quote origin English usage

Presto! Google runs the search without complaint and you're in for some well-honed results.

TIP
Common words such as "I," "a," "the," and "of" actually do no good in the first place. Called "stop words," they are ignored by Google entirely. To force Google to take a stop word into account, prepend it with a + (plus) character, as in:+the.
Source code accompanying the book can be downloaded from here. Also of possible interest to those looking for more hacks is this site

Wednesday, June 23, 2004

Using the "RunAs" Command

Microsoft's Aaron Margosis gives a very detailed rundown of how to go about administrative tasks in Windows while running with limited privileges.

Virtuoso and the Semantic Web

Kingsley Idehen talks about Tim Berners-Lee's Semantic Web vision in relation to his own Virtuoso product offering.

As I continue my quest to unravel the thinking and vison behind the "Universal Server" branding of Virtuoso, it always simplifies matters when I come across articles that bring context to this vision.

Tim Berners-Lee provided a keynote at WWW2004 earlier this week, and Paul Ford provided a keynote breakdown from which I have scrapped a poignant excerpt that helps me illuminate Virtuoso's role in the inevitable semantic web.

First off, I see the Semantic Web as a core component of Web 2.x (a minor upgrade of Web 2.0), and I see Virtuoso as a definitive Web 2.0 (and beyond) technology, hence the use today of the branding term "Universal Server". A term that I expect to become a common product moniker in the not too distant future.

The first challenge that confronts the semantic web is the creation of Semantic content. How will the content be created? Ideally, this should come from data, at the end of the day this is a data contextualization process.

A Key Phrase

In a post by Dare Obasanjo, he says something that might easily be overlooked, but which I think is of great importance if true. In particular, he says:

However the primary scenarios the WinFS folks want to tackle are about rigidly structured data which works fine with using objects as the primary programming model. (emphasis added)
That would seem to imply that WinFS won't be doing much of anything to cure the problems arising from the ongoing information glut: how much of the data most people handle on a daily basis is "rigidly structured"? I'd say virtually none.

Monday, June 21, 2004

A Serendipitous Discovery

A fellow who goes under the name Woit led me to this article by Barry Mazur in April's Bulletin of the AMS. To show what's so fascinating about it, I'll let Woit's own words do the talking:

It brings together ideas about elliptic curves and deformations of Galois representations that were used by Wiles to prove Fermat's last theorem, mirror symmetry, quantization, non-commutative geometry and much more. I'm not convinced it all hangs together, but it's a wonderful piece of expository writing.
If that isn't fascinating, I don't know that anything is. In fact, I could scarcely have imagined that all these things had much to do with each other, much in the same way, I suppose, as it struck me as farfetched on first learning about the GUE Hypothesis.

For someone with a background in pure mathematics, and whose interests have always lain in a branch long thought the least practical of all, it's always a pleasure to discover possible new deep and meaningful links between number theory and the physical world.

The Evolution of Windows Dialog Templates

When it comes to the arcane details of Windows programming, everyone knows that Raymond Chen is the guy to look to, and he doesn't dissapoint with his latest post on the 32-bit classic DIALOG template. The older article on the 16-bit version is also worth taking a look at, if only for historical perspective.

OS X Made Difficult

Microsoft's Matt Evans tells a horror story about fixing his sister-in-law's G3 machine after a system update installed a dodgy driver. Talk about an eye-opening experience!

Half of the stuff he describes won't even begin to make sense to the average person who's never dealt with Unix much at command-line level - suffice to say that while OS X does a great job of hiding the fiendish complexity of having a hybrid of NeXTStep and MacOS resting on top of FreeBSD from the user most of the time, there will be occasions on which the underlying complexity will need to be grappled with to get things going again. In a way this shouldn't be surprising: how realistic is it to expect the power of a modern operating system without any of the accompanying complexity?

Windows NT and VMS

A December 1998 article by Mark Russinovich in which he lays out just how much the architecture of Windows NT owes to the design of VMS. Essentially, David Cutler brought his team over to Redmond and rewrote VMS from the ground up, except this time they made sure it could run Win32 programs.

When Microsoft released the first version of Windows NT in April 1993, the company's marketing and public relations campaign heavily emphasized the NT (i.e., New Technology) in the operating system's (OS's) name. Microsoft promoted NT as a cutting-edge OS that included all the features users expected in an OS for workstations and small to midsized servers. Although NT was a new OS in 1993, with a new API (i.e., Win32) and new user and systems-management tools, the roots of NT's core architecture and implementation extend back to the mid-1970s.

[............]

Most of NT's core designers had worked on and with VMS at Digital; some had worked directly with Cutler. How could these developers prevent their VMS design decisions from affecting their design and implementation of NT? Many users believe that NT's developers carried concepts from VMS to NT, but most don't know just how similar NT and VMS are at the kernel level (despite the Usenet joke that if you increment each letter in VMS you end up with WNT (­Windows NT).

[............]

"Why the Fastest Chip Didn't Win" (Business Week, April 28, 1997) states that when Digital engineers noticed the similarities between VMS and NT, they brought their observations to senior management. Rather than suing, Digital cut a deal with Microsoft. In the summer of 1995, Digital announced Affinity for OpenVMS, a program that required Microsoft to help train Digital NT technicians, help promote NT and Open-VMS as two pieces of a three-tiered client/server networking solution, and promise to maintain NT support for the Alpha processor. Microsoft also paid Digital between 65 million and 100 million dollars.
Yet more evidence that there are few people alive who can beat Bill Gates for business acumen. Digital sold their birthright for a pittance, and now where are they?

A First Look at 3-D Support in Avalon

A Longhorn Developer Center article on the new graphics-subsystem's support for 3 dimensional drawing.

Applies to:
Longhorn Community Technical Preview, WinHEC 2004 Build (Build 4074)

Summary:
Demonstrates how Avalon's simple 3-D support enables you to create three-dimensional scenes. Discusses the Avalon ViewPort3D element and the use of point of view, camera, and lights in XAML markup.

Sunday, June 20, 2004

Three Theses of Representation in the Semantic Web

Abstract:
The Semantic Web is vitally dependent on a formal meaning for the constructs of its languages. For Semantic Web languages to work well together their formal meanings must employ a common view (or thesis) of representation, otherwise it will not be possible to reconcile documents written in different languages. The thesis of representation underlying RDF and RDFS is particularly troublesome in this regard, as it has several unusual aspects, both semantic and syntactic. A more-standard thesis of representation would result in the ability to reuse existing results and tools in the Semantic Web.

Dare Obasanjo - Jon Udell and WinFS

Obasanjo responds to Jon Udell's article, and his comments are surprisingly ambivalent for an employee of a company widely known as the Borg.

Jon Udell has started a series of blog posts about the pillars of Longhorn. So far he has written Questions about Longhorn, part 1: WinFS and Questions about Longhorn, part 2: WinFS and semantics which ask the key question "If the software industry and significant parts of Microsoft such as Office and Indigo have decided on XML as the data interchange format, why is the next generation file system for Windows basically an object oriented database instead of an XML-centric database?"

I'd be very interested in what the WinFS folks like Mike Deem would say in response to Jon if they read his blog. Personally, I worry less about how well WinFS supports XML and more about whether it will be fast, secure and failure resistant. After all, at worst WinFS will support XML as well as a regular file system does today which is good enough for me to locate and query documents with my favorite XML query language today. On the other hand, if WinFS doesn't perform well or shows the same good-idea-but-poorly-implemented nature of the Windows registry then it'll be a non-starter or much worse a widely used but often cursed aspect of Windows development (just like the Windows registry).

As Jon Udell points out the core scenarios touted for the encouraging the creation of WinFS (i.e search and adding metadata to files) don't really need a solution as complex or as intrusive to the operating system as WinFS. The only justification for something as radical and complex as WinFS is if Windows application developers end up utilizing it to meet their needs.
Obasanjo's comments are well worth taking into account; I for my part don't see how WinFS can possibly live up to the hype Microsoft is generating about it, especially in light of the amount of time and effort that has gone into dealing with these issues with regards to the W3C's Sematic Web initiative.

The Semantic Web as a Language of Logic

An article by Tim Berners Lee in which he discusses the semantic web from the viewpoint of mathematical logic. The key thing to take away is that there's an inherent tension between expressiveness and manageability - the more we can say, the harder it is for us to be sure that what we say makes sense, or to say things within a reasonable timeframe; think of the difference between propositional logic and first-order logic.

This looks at the Semantic Web design in the light a little reading on formal logic, of the Access Limited Logic system, in particular, and in the light of logical languages in general.
I'll have quite a bit more to say about this later when I'm less pressed for time.

Chris Sells - "WinFS is a Domain Ontology with Style"

An old post by Chris Sells in which he links to a guy by the name Michael Herman, telling us the following:

Michael Herman caught onto WinFS's real nature last year and I'm just now catching up.
Herman in turn refers us to a Stanford paper titled "Ontology 101: A Guide to Creating Your First Ontology". Here's what Herman has to say about the insights he's obtained from reading the Stanford primer:
This paper has (for me) brought a lot of deeper clarity to what WinFS really is: WinFS is a knowledge repesentation, storage and synch system ...a literal implementation of a system strongly based on ontology related concepts (whether intended by MS or not).
To this I can only say "But of course! What else could it be?" I've read my John Sowa, and it's precisely because I'm aware of the central importance of ontologies to knowledge representation that I find myself bemused whenever I hear someone say that metadata is essentially a solved problem. Nothing could be further from the truth!

Saturday, June 19, 2004

Material Related to Information Retrieval

Here's my chance to make use of Google's organizing powers to reduce the number of browser tabs I have open, by saving interesting material for later perusal:

  1. Longhorn Developer Center - WinFS
  2. The Semantic Web (for Web Developers)
  3. John Udell - Questions About Longhorn, Part I
  4. XML.com Article on the Semantic Web (also mentions WinFS)
  5. Slashdot Discussion on WinFS

Friday, June 18, 2004

How to write a DCOM server in C#

Adi Oltean walks us through the process of implementing a DCOM server with C#. "Why is this interesting", you ask? Here's why:

Well, why DCOM and not .NET Remoting? For one thing, DCOM offers a secure interprocess communication channel through TCP/IP... which .NET remoting doesn't have unfortunately. Also, a DCOM server can be hosted in almost any process, including Windows Services!
And here's an outline of the actual implementation process.
The ideas are described below (this is pretty straightforward assuming you already know COM)
1) Your server process will expose a COM class factory that would just create your .NET object.
2) In COM you register the class factory using the standard CoRegisterClassObjects API
3) Make sure you call CoInitializeSecurity on your first process, for example to allow only Administrators to call in
4) Register your .NET assemblies with REGASM.EXE. Make sure your .NET class is visible through COM so CCW can be created around it (more details in MSDN on COM Interop section).
5) Remove the auto-generated InprocServer32 key after registration (REGASM puts it there but we are going out-of-proc)
6) Add the standard LocalServer32 / AppID registry keys.

That's it! Now you have a secure DCOM service implemented entirely in C# :-)
The full source code is included in the actual post.

On the Economics of GMail

A commenter by the name Ian Montgomerie says some very insightful things on Brad DeLong's Webjournal:

On another note, though, don't underestimate the cheapness of Google's storage. It's obviously very cheap relative to the competition, but in absolute terms they still can't afford to give everyone a gigabyte of email storage now. They know darn well that most people will use maybe 10% of that any time soon, and mostly in text (they would be stupid not to be compressing that text in the background down to a level where the typical GMail user's 100 megs or less of email are 10 megs or less of actual hard disc space, so a typical modern hard disc could serve 10,000 users). EVENTUALLY people will accumulate enough email, especially with attachments, so that many are actually using the bulk of their space. But that will take years by which time hard discs will be much cheaper. Google can reliably bet that from this point onward, hard disc prices will probably drop at least as fast as email volume rises.
Indeed, the speed at which storage costs per GB continues to drop is truly astounding to behold: Hitachi and Seagate both already have 400GB (i.e, 0.4 Terabyte) hard drives for sale, and this is for the low-end consumer market.

Thursday, June 17, 2004

Linus vs. Tanenbaum

A piece of history: the great debate between Finnish hacker Linus Torvalds and OS guru Andrew Tanenbaum, from way back in 1992. Following is what Tanenbaum had to say about Linux:

Most older operating systems are monolithic, that is, the whole operating system is a single a.out file that runs in 'kernel mode.' This binary contains the process management, memory management, file system and the rest. Examples of such systems are UNIX, MS-DOS, VMS, MVS, OS/360, MULTICS, and many more.

The alternative is a microkernel-based system, in which most of the OS runs as separate processes, mostly outside the kernel. They communicate by message passing. The kernel's job is to handle the message passing, interrupt handling, low-level process management, and possibly the I/O. Examples of this design are the RC4000, Amoeba, Chorus, Mach, and the not-yet-released Windows/NT.

While I could go into a long story here about the relative merits of the two designs, suffice it to say that among the people who actually design operating systems, the debate is essentially over. Microkernels have won. The only real argument for monolithic systems was performance, and there is now enough evidence showing that microkernel systems can be just as fast as monolithic systems (e.g., Rick Rashid has published papers comparing Mach 3.0 to monolithic systems) that it is now all over but the shoutin`.

MINIX is a microkernel-based system. The file system and memory management are separate processes, running outside the kernel. The I/O drivers are also separate processes (in the kernel, but only because the brain-dead nature of the Intel CPUs makes that difficult to do otherwise). LINUX is a monolithic style system. This is a giant step back into the 1970s. That is like taking an existing, working C program and rewriting it in BASIC. To me, writing a monolithic system in 1991 is a truly poor idea.
And what about Linus' reply? Here's an excerpt, to give a flavor of it:
In article <12595@star.cs.vu.nl> ast@cs.vu.nl (Andy Tanenbaum) writes:
I was in the U.S. for a couple of weeks, so I haven't commented much on LINUX (not that I would have said much had I been around), but for what it is worth, I have a couple of comments now.

As most of you know, for me MINIX is a hobby, something that I do in the evening when I get bored writing books and there are no major wars, revolutions, or senate hearings being televised live on CNN. My real job is a professor and researcher in the area of operating systems.
You use this as an excuse for the limitations of minix? Sorry, but you loose: I've got more excuses than you have, and linux still beats the pants of minix in almost all areas. Not to mention the fact that most of the good code for PC minix seems to have been written by Bruce Evans.

Re 1: you doing minix as a hobby - look at who makes money off minix, and who gives linux out for free. Then talk about hobbies. Make minix freely available, and one of my biggest gripes with it will disappear. Linux has very much been a hobby (but a serious one: the best type) for me: I get no money for it, and it's not even part of any of my studies in the university. I've done it all on my own time, and on my own machine.

Re 2: your job is being a professor and researcher: That's one hell of a good excuse for some of the brain-damages of minix. I can only hope (and assume) that Amoeba doesn't suck like minix does.
The normally mild-mannered Finn is clearly no shrinking violet when pushed sufficiently hard.

Looking back on events, the great to-do Tanenbaum made about microkernels vs. monolithic kernels now seems utterly strange; it's now crystal clear that this was one case of an academic being so infatuated with the latest elegant ideas about design that he allowed himself to be blinded to just how far one could take less avant-garde approaches. In that respect the whole microkernel vs. monolithic kernel argument now looks very much like the RISC vs. CISC debate.

Verifying Digital Signatures on Windows 2000/XP/2003 System Files

Raymond Chen blogs about the File Signature Verification (sigverif) utility.

VSTS Newsgroups

Microsoft now has publicly accessible newsgroups for Visual Studio Team System, the update formerly known as "Whidbey". Information on how to access these newsgroups is available on this page.

Wednesday, June 16, 2004

Helping the Enemy?

Microsoft's Larry Osterman discusses a bug he found in Mozilla, and though he didn't bother to go to the trouble of filing a bug report himself, a helpful reader of his blog did, and it's now been voted up a few times. Here we see the virtue of openness - one simply can't imagine anything like this ever happening with Microsoft's own products.

Joel on Software - How Microsoft Lost the API War

A very long but amusing rant by Joel Spolsky about the workability of Microsoft's penchant for continuous revolution in the age of web applications. He also makes a vital point that I've made several times myself:

So the Web user interface is about 80% there, and even without new web browsers we can probably get 95% there. This is Good Enough for most people and it's certainly good enough for developers, who have voted to develop almost every significant new application as a web application.

Which means, suddenly, Microsoft's API doesn't matter so much. Web applications don't require Windows.

It's not that Microsoft didn't notice this was happening. Of course they did, and when the implications became clear, they slammed on the brakes. Promising new technologies like HTAs and DHTML were stopped in their tracks. The Internet Explorer team seems to have disappeared; they have been completely missing in action for several years. There's no way Microsoft is going to allow DHTML to get any better than it already is: it's just too dangerous to their core business, the rich client. The big meme at Microsoft these days is: "Microsoft is betting the company on the rich client." You'll see that somewhere in every slide presentation about Longhorn. Joe Beda, from the Avalon team, says that "Avalon, and Longhorn in general, is Microsoft's stake in the ground, saying that we believe power on your desktop, locally sitting there doing cool stuff, is here to stay. We're investing on the desktop, we think it's a good place to be, and we hope we're going to start a wave of excitement..."

The trouble is: it's too late.
Microsoft isn't opposed to standards compliance simply because it reduces lock-in: after all, Mozilla and Safari both bend over backwards to accomodate Internet Explorer's idiosyncracies, and they are largely successful in doing so. The real reason behind Microsoft's inaction is that a web based on an easily extensible framework like XHTML 1.1 is a web in which semantically rich content can be made available to all regardless of what clients they're using. SVG would make possible fast and entirely web-based alternatives to heavyweights like Adobe Illustrator, as well as desktop-like richness in user interfaces; MathML would be the beginning of the end for old desktop standbys like Maple and Mathematica; all sorts of new extensions to XHTML could be invented to provide through the web browser the very sorts of functionality that have traditionally made Windows so compelling a platform. Bill Gates is no fool, so I'm sure he's realized all of this, and it'll be a cold day in hell before Microsoft makes any changes to Internet Explorer that bring this dangerous vision any closer to reality.

Java vs C++: Which is Faster?

Via Slashdot, it transpires that a fellow by the name Mark Lea has updated the old "Java vs. C++ shootout" from way back in 2001, in which Java fared badly at the time; this time, though, things are rather different.

"I was sick of hearing people say Java was slow," says Keith Lea, "so I took the benchmark code for C++ and Java from the now outdated Great Computer Language Shootout (Fall 2001) and ran the tests myself." Lea's results three years on? Java, he finds, is significantly faster than optimized C++ in many cases.
There are some caveats to the test results, however, the most important of which is Lea's choice of compiler. Why GCC 3.3.1, instead of the Intel compiler or even the freely available Visual C++ Toolkit? As I've mentioned earlier, if there's one thing GCC is not, that is acting as a highly-optimizing compiler designed to squeeze the last ounce of speed out of a given platform.

The other point of contention with Mark Lea's tests is the choice of optimizer switches he made. In particular, he only opted for -O2 optimization, which is fine for most applications that spend their time waiting on data or user input, but far from optimal when high performance is important. He defends his choice by saying that -O3 performs space-speed tradeoffs, but that is hardly a worthwhile reason for not using that instead: hard drive space is cheap and continuously, rapidly getting cheaper, and most people would rather cope with a little storage bloat than spend more time waiting around for code to finish executing.

Having said all the above, I do think that these updated benchmarks indicate that Java is not quite the dog its reputed to be, at least not nowadays. As some of the Slashdot commenters point out, the JVM does have certain advantages over a C++ compiler, not least of which is that it can profile running code and optimize accordingly, something no statically compiled program can ever do. Another advantage Java has over C++ (and one Java shares with Fortran*), is that it lacks pointers, which makes code analysis and optimization a lot easier than it otherwise might be. Whether these advantages will outweigh the benefits of static compilation is something that must be ascertained on a case-by-case basis.

One thing that does stand out from the testing is how much faster the server JVM is than the client version in nearly all tests, so much so that Lea says the following:
They also show that no one should ever run the client JVM when given the choice ... [and] everyone has the choice.
*One final point: no one in his right mind who knew anything about high performance computing would ever write his numerical routines in C++. There's a very good reason why Fortran compilers are still in demand!

Monday, June 14, 2004

Mathematics Lecture Notes

The University of Sydney's Bob Howlett maintains a web page with freely available lecture notes on several topics of mathematical interest, including number theory, galois theory and the theory of group representations. While we're talking about group representations, Cambridge's Lek-Heng Lim also has a web page up for the Part IIB Mathematical Tripos, covering - yes, you guessed it - representation theory. By the way, does anyone find it a tad bit strange that a page on lectures for Cambridge students is actually being hosted on a Stanford server?

Sunday, June 13, 2004

URL Rewriting with ASP.NET

An article on how to duplicate Apache-style mod_rewrite functionality on ASP.NET.

Mozilla Futures

This links to a slideshow which outlines the planned roadmap for Mozilla as it approaches version 2.0. The main thrust seems to be to provide an alternative to Microsoft's proprietary XAML GUI markup language, which is expected to be an integral part of Longhorn in 2006/2007.

Native, fast SVG support is a central part of this vision of Mozilla's future, as is hardware-assisted graphics acceleration via OpenGL and GDI+, a la Quartz Extreme and Longhorn's Desktop Compositing Engine - except that Windows 2000/XP users won't have to upgrade to obtain the benefits of hardware acceleration within the browser. Here's as good a reason as any for non-game developers to be grateful to John Carmack - he kept OpenGL on Windows alive virtually singlehandedly, by refusing to write to the DirectX API. If not for him, native OpenGL support would long since have been relegated to the high-end workstation segment of the market.

The CoolWebSearch Chronicles

A website run by a Dutch student who goes by the name Merijn Bellekom, devoted to everyone's favorite piece of spyware nastiness. He not only keeps a running tally of the numerous variants of this pest, but also provides useful tools that can (in most cases) get rid of CoolWebSearch even when old standbys like Spybot and Ad-Aware are defeated.

One interesting factoid I picked up on the page is that there are CWS variants so tenacious that not even the sorts of specialized tools he provides are able to remove them. This ought to serve as a wakeup call to all those lazy Windows users who never bother to run Windows Update and don't have the good sense to listen when urged to get themselves a real web browser. The smug assumption that third-party tools will always be able to compensate for end-user laziness and stupidity is about to die a long-overdue death.

Saturday, June 12, 2004

Blum-Blum-Shub

Wikipedia has a pretty decent entry on the Blum-Blum-Shub pseudorandom number generation algorithm. Why should anyone care about this, you ask?

The generator is not appropriate for use in simulations, only for cryptography, because it is not very fast. However, it has an unusually strong security proof, which relates the quality of the generator to the difficulty of integer factorization. When the primes are chosen appropriately, and O(log log M) bits of each xn are output, then in the limit as M grows large, distinguishing the output bits from random will be at least as difficult as factoring M.

[............]

If integer factorization is difficult (as is suspected) then BBS with large M will have an output free from any nonrandom patterns that can be discovered with any reasonable amount of calculation. There are very few random number generators or cryptographic systems with such strong results known.
That's why.

Optimization: Your Worst Enemy

For a historically-informed and much more nuanced overview of the perils and pitfalls of performance optimization, this treatise can't be beat. No matter how much you may think you know, there's always someone who knows a lot more than you do.

Why Bother Learning Assembly Language?

A very interesting Slashdot discussion is going on about the pros and cons of learning assembly language. Although I'm familiar with more than one assembly language myself, including Motorola 68000, DEC Alpha and (yuck!) Intel x86, I'm personally of the opinion that unless one is working in a field like game development or numerical computing where top performance is the utmost prioirity, messing about with assembly is mostly a waste of time, and performance improvements are better obtained by focusing on algorithms, i/O and program architecture. Even the most carefully hand-optimized piece of assembly code that happens to use an O(n2) algorithm like bubble sort will eventually be outperformed by a piece of Perl or VB6 code that uses an O(n log n) alternative like merge sort.

For numerical computation, game development or anywhere else where access to the SIMD functionality included in most modern-day CPUs is vital, intimate knowledge of assembly-language will continue to be vital, as it will be for those who have to work at the systems level, e.g. low-level driver writers and the like. For the vast majority of programmers outside these fields, knowledge of assembly language is only important insofar as it is difficult to do binary-level debugging without it, as is often necessary when source-code is not available. Fiddling about with register instructions and so forth is a massive drain of developer time and energy which can be put to more profitable uses, and the worst thing of all is that in most cases, the hand-tweaked code people come up with will actually turn out to be slower than what they'd have obtained had they left the optimizing to a decent compiler.

My own two bits of advice to those concerned about performance; don't bother programming with code execution speed foremost in mind, other than to avoid gross inefficiencies where possible. Write clean, maintainable, future-oriented code, and see whether it performs acceptably before going the extra step - you'll be surprised how rarely the speed of code execution is actually an issue these days, with network I/O usually being the main holdup. If greater speed does turn out to be of the essence, the first thing to do is profile, profile, profile, and only after identifying the most critical hotspots should you bother looking for optimizations; even then, you'll likely find that you can either obtain the desired improvements by modifying the algorithm you're using or by rethinking some design considerations, or you can essentially "outsource" the problem by relying on pretuned 3rd party libraries like ATLAS or the AMD Core Math Library. I know that I relied heavily on the Compaq Math libraries back when I was writing high-performance routines for the Alpha 21264.

Friday, June 11, 2004

On the Importance of Data Modeling

An interesting post on data modeling by a Sun Microsystem employee named Kristen.

If I had a nickel for everytime some Web publishing manager told me their job requires them to copy and paste content between various Web sites... Well, let's just say, I could buy a lot more stock. :)

I hear about this from friends who work on sites across the industry -- there's lots of mindless manual repurposing of content going on out there. This is a problem on sooo many levels on sooo many sites... not only is it expensive, and time consuming, but it even affects usability of web sites since all that manually repasted data is pretty much trapped in its own little silo never to be used again.

Case in point: I was in a web technology User Committee meeting the other day and the subject came up that we're using two (or three or more?) different tools to create and manage a certain type of product content, with each destination site pretty much storing its own copy of the content.
We're copying and pasting, or building perl scripts to do our data mining. What we really want to do is write it once and reuse it.

A well-respected colleague, Martin Hardee, responded "You know, it really all comes down to data models, right?"

My mind jumped in gear... Maybe these blogs are my chance to make metadata evangelists and data modeling believers out of everyone! (ok, maybe a few, if they start reading!) I know every company has this same challenge. Now thanks to Will and others, we now have a cool place to talk about it.

Thursday, June 10, 2004

Design Patterns: Abstract Factory

An important pattern to know. Not everything can be reduced to singletons ...

Wednesday, June 09, 2004

An Annoying Firefox Bug

I love Firefox as a browser, so much so that I no longer use anything else, but if there's one misfeature it has that drives me up the wall, it's that it often displays British pages without explicit charset settings in their <meta> tags using the Chinese Simplified (GB18030) encoding instead, even if the pages in question have been sent with the right encoding information embedded in the HTTP header. I plan to file this in Bugzilla once I've done a little more testing to ensure that there's nothing unique about my setup that's triggering this behavior.

Wow

Apple has just announced its new Power Mac G5 lineup. and the only thing I can say is I'm impressed. Not only have the clock speeds beeen ramped up to a maximum of 2.5 GHz, but the entire product range now consists of dual-CPU machines. Here's one more respect in which Apple is ahead of the PC crowd, where 2-CPU machines continue to be restricted to the server and high-end workstation market.

One thing to note: though the 2.5 GHz CPU speed on the top of the line G5 might not seem like a lot in the current era of 3.4 GHz Pentium 4 processors, the really important thing is the amount of work that a machine can execute in a given time, which is a measure of IPC multiplied by clock speed. One consquence of the PowerPC's cleaner RISC design is that its IPC count is higher than that of the x86 alternatives, so a single-CPU PowerPC running at 2.5 GHz is actually comparable if not faster at most tasks than a Pentium 4 running at 3.4 GHz. It's also worth pointing out that the fastest AMD chips out are also actually running at 2.4 GHz.

The rough parity between the fastest chips from IBM, AMD and Intel isn't surprising once one takes into account that they're all using essentially the same process technologies, and none of the three has such a stranglehold on design talent as to be able to outrace the others for very long. The difference between Intel and the other two CPU vendors lies in the conscious decision by Intel to trade off IPC for higher clock speeds, with the expectation in mind that most consumers are so taken in by raw Megahurtz [sic] that they'd be fooled into thinking the Intel offerings were superior. While this probably did do the job with the most ignorant consumers, for the most part it seems to have failed, as AMD hasn't really ceded any market share to Intel in the desktop sector.

At any rate, what's really holding Apple back at present isn't its chip architecture but the quality of its compilers. GCC is a marvelously complete, portable and standards compliant compiler, but unlike Visual Studio .NET 2003 or Intel VTUNE, it was never designed to extract every last ounce of performance from any particular CPU architecture. IBM certainly has the requisite compiler technology to squeeze big gains out of the CPUs it's delivering to Apple, so I don't see what is preventing the licensing of these tools to the broader Apple developer community, even if at an appropriately hefty price.

ADDENDUM: I spoke too soon. IBM has already released the necessary software for the OS X platform.

Tuesday, June 08, 2004

What a Surprise

Secunia's just disclosed two new Internet Explorer vulnerabilities. The suggested solution: disable Active Scripting support for all but trusted web sites - and, if I may throw in my own bit of advice, use another browser.

The Handbook of Applied Cryptography

An excellent reference on the subject, well worth its weight in gold, and available online for all to read for free! If there's one shortcoming of the book worth mentioning, it is that, having last undergone major revision in 1996, it gives no coverage of the latest developments in the field (e.g. AES or the newest ECC research). Still, mastering all the material in here will take one a pretty darned long way, and at this price there's really not much for anyone to complain about.

The Handbook would probably work best as a second look at the subject for someone who's already gone through Bruce Schneier's Applied Cryptography, which provides a far less mathematically demanding treatment of the subject matter. Although there are no proofs in the Handbook, it does presume some acquaintance with abstract algebra, number theory, probability and information theory, one that cannot really be obtained just by reading through the theorems stated as givens in Chapter 2.

Monday, June 07, 2004

The Joy of Linux

This post by Matt Mullenweg illustrates why Linux, whatever its merits as a server, engineering and HPC platform, is not about to take the desktop by storm anytime in the forseeable future.

It’s not there yet. I’m being totally unfair, because comparing Windows or OS X to the Linux distribution I’m using (Gentoo) is like apples and oranges. Gentoo is meant for people who are comfortable with the command-line and want to experiment. (It’d be fairer to compare Windows to Suse.) But I just want to bridge a connection between an ethernet card and a wireless USB device. Is that too much to ask? When I did this in Windows I just highlighted the two connections, right-clicked, and chose “Bridge Connections.” It spun for a little bit and then it was done. End of story.

The work started yesterday, when I figured out that the reason nothing would emerge is that there were bad GCC flags in my make.conf file. How they got there, I’ll never know. Bad ebuild I guess. So I got that fixed, synced, and updated world. 85 packages! The next day I compiled a new kernel (2.6.7-rc2) but forgot to load the Tulip module required for my ethernet card. Recompile, reboot. Runs great, and I tell myself everything is running faster. Right now I’m bridging my desk LAN to the main router through the Windows desktop, and since I just moved the linux box on a new UPS I’d like to move the wireless connection there too. I was feeling lucky, so I tried just plugging it in to see what happened. dmesg, device not recognized. Search search search the excellent Gentoo forums, find out that to get my MA101 working I shouldn’t use the drivers from Sourceforge, but rather the at76c503a Atmel drivers for wireless USB devices. Download, compile against current kernel sources, install. Reboot. Don’t have any wireless tools. emerge wireless-utilities. Twiddle for 45 minutes to see why it won’t see any networks. Forgot to enable “Wireless radio (non-HAM)” support in the kernel. Recompile. Reboot. iwscan shows my network, iwconfig wlan0 works as expected. The instructions that tell me to put in ad-hoc mode are wrong. (Hour later.)
The funny thing is that this extensive excerpt doesn't even describe in full the hoops he had to go through, and what is worse yet, without managing to succesfully resolve the problems he was having.

I've been where he's at far too often in the past myself, and if there's one thing to be said about Linux, it's that its rough edges force one to learn a whole lot more about the internal workings of Unix-like systems than one might ever voluntarily have chosen to - and this is true of all Linux distributions in the long run, not just Gentoo. Does anyone really see some blue-haired grandma spending her free time leafing through Maurice Bach's Design of the UNIX Operating System or Uresh Vahalia's UNIX Internals, simply so she can get her new digital camera working?

If anyone is intent on running UNIX without going through endless hassles over arcane technical issues, I strongly suggest getting a machine from this company. Sure, it'll cost more than downloading a copy of Debian and burning it to CD-ROM, but if one's time is worth anything at all, the difference will very quickly be made up in terms of time lost for no good reason. Linux has its place - I'd recommend it over Windows Server 2003 as a DNS, web or file server any day - but that place isn't in the home of the average individual.

Sunday, June 06, 2004

A Tale of Two Sieves

An elementary introduction by Carl Pomerance to the quadratic and number-field sieve methods for integer factorization; these two techniques constitute the fastest currently known brute-force attacks on traditional public-key cryptography, the number-field sieve being the faster of the two. These two sieves are ineffective in attacking elliptic curve cryptosystems, however, as they both rely on the index-calculus method.

Making a Faster Cryptanalytic Time-Memory Trade-Off

Amazing if true. I've got to find the time to read this.

In 1980 Martin Hellman described a cryptanalytic time-memory trade-off which reduces the time of cryptanalysis by using precalculated data stored in memory. This technique was improved by Rivest before 1982 with the introduction of distinguished points which drastically reduces the number of memory lookups during cryptanalysis. This improved technique has been studied extensively but no new optimisations have been published ever since. We propose a new way of precalculating the data which reduces by two the number of calculations needed during cryptanalysis. Moreover, since the method does not make use of distinguished points, it reduces the overhead due to the variable chain length, which again significantly reduces the number of calculations. As an example we have implemented an attack on MS-Windows password hashes. Using 1.4GB of data (two CD-ROMs) we can crack 99.9% of all alphanumerical passwords hashes (2 37 ) in 13.6 seconds whereas it takes 101 seconds with the current approach using distinguished points. We show that the gain could be even much higher depending on the param-eters used. (emphasis added)
Here's a direct link to the full paper.

A Cryptographic Compendium

Essentially a free online textbook on cryptographic techniques, written by John Savard. It is very comprehensive in its historical scope, though it by no means provides a rigorous introduction to the subject. Cryptography is a subject that can't really be well understood without at least a bit of background knowledge about groups, rings and fields, and ideally some exposure to elementary algebraic number theory.

Saturday, June 05, 2004

S-Box Design: A Literature Survey

Now here's a resource to bring delight to crypto junkies everywhere!

Many block ciphers are based on the old Shannon idea of the sequential application of confusion and diffusion. Typically, confusion is provided by some form of substitution ("S-boxes"). So the obvious question is whether some substitutions are better than others. The obvious answer is "Yes," because one possible substitution maps every value onto itself, just as though there were no substitution at all.

So the hunt was on for measures which would distinguish between "bad" and "good" substitutions, and for techniques to construct "good" substitutions. But since weakness measures are related to attacks, new attacks often imply a need for new measures. And since we cannot know what attack an Opponent may use, how can we select a substitution which will defeat that attack?

Accordingly, this reviewer has a bias for randomly-selected and keyed S-boxes. While these cannot be expected to have optimum strength against whatever is being measured, they can be expected to have average strength against even unknown attacks: Where there is no systematic design, there can be no systematic weakness. And when S-boxes are chosen at random, everyone can be sure that no S-box "trap door" is present. Keying the S-boxes inevitably takes time, but some authors count this as an advantage in slowing attacks.

Debugging Tools for Windows Updated

Microsoft has just released a new version of its Debugging Tools for Windows; information about what's new in the latest release can be found here.

Making An Operating System Faster

This article outlines 10 changes Apple made to OS X to make it "faster", i.e. more responsive - a very different thing from increasing the throughput rate, which is also a legitimate way to define "faster." I'll concede that responsiveness is far more important to most desktop users though, and it is most of the time even for me, unless I have, say, a heavy compile building, or am executing a demanding and long-running numerical computation.

As one might expect, most of the optimizations carried out by Apple revolve around optimizing for the memory heirarchy*: in plain english, making judicious use of caching wherever possible, and being as clever as possible about what to cache. As the article concedes, none of the techniques employed are unique to Apple, as Windows 2000/XP/2003 and Linux also draw on the same methods, but what makes the difference between one platform and another is the quality of the implementation.

One shortcoming of Microsoft's offerings that I can point out straight away is that neither FAT32 nor NTFS pay any attention to file fragmentation without manual user intervention, but expecting the ordinary uninformed user to routinely run the slow and ineffective Disk Defragmenter that comes with Windows 2000 and XP is asking too much. The end result of this state of affairs is that over the course of a few months, most Windows users experience a dramatic slowdown in the performance of their machines even when all else is fine, and the typical, completely unnecessary response (learned from the bad old days of Windows 95/98/ME) is to reformat and reinstall Windows from scratch.

Another point on which Microsoft compares unfavorably with Apple is in the support provided by its development tools for application profiling. It's bad enough that the cheapest version of Visual Studio .NET that's worthwhile for real-world use is priced out of the reach of many developers, but what is worse still is that it comes without a profiler by default! Using Perfmon for profiling, as suggested in the .NET Framework SDK, is not a serious suggestion as far as I'm concerned.

*For more info on which, see Hennessy and Patterson, the reference textbook on Computer Architecture.

Friday, June 04, 2004

Bye-bye Evil Empire? Not Quite Yet.

I had the following to say in response to a post by Ryan Chapman, a developer in Microsoft's .NET Compact Framework group. I'm reposting this comment here because it outlines a suspicion that just occurred to me about Microsoft's neglicence with respects to standards compliance in Internet Explorer - that it is a deliberate strategy to forestall a Semantic Web based on open standards, in favor of a Microsoft-centric alternative focused on Longhorn. All the features announced for Longhorn fit in with just such a strategy - for instance, who needs XHTML, RDF, OWL and the rest when you've got WinFS?

I've certainly noticed a dramatic increase in openness from your company, but I'll only really believe that Microsoft has taken a turn for the better in its business practices when you actually start to do something in response to all the web developers who've been begging you for ages to update standards-compliance in Internet Explorer.

Shoddy support for PNG, XHTML and CSS2 raises development costs in time and money, and what is most annoying is that a bunch of amateurs can get right in Firefox what Microsoft has been making excuses for refusing to deal with. The latest tack about the W3C not supplying test cases also won't wash, as the standards in question were written up with Microsoft's full participation in the process.

I'm excited about the new possibilities that Longhorn will make possible, but if I were forced to choose I would gladly forego all of them for the sake of native CSS2.1 and XHTML 1.1 support in every Windows user's default browser - *that* is how much support for web standards means to me ...

On second thought, perhaps it's that very yearning that has Microsoft worried? Why enable a Semantic Web based on open standards when one can pitch a proprietary alternative as a Windows upgrade? I wish I could dismiss this as conspiracy-mongering, but the history of your firm - astroturfing, backroom deals to "knife the baby", secret plots to grow the "polluted Java market", even supporting shill organizations like the Alexis de Tocqueville Institution which issue transparently bogus anti-Linux "studies" - makes it impossible for me to do so.

Galton's Fallacy

Here's a post to file in the "must carefully re-read later" category.

Imagine you want to conduct a study on the effectiveness of punishment and reward on flight training. You praise trainees for good landings and reprimand them for near crashes, and observe whether this has any effects.

After spending some time on the airfield your data indicate trainees did worse on their next flight if they had just been praised, and fly better just after being yelled at.

Then you have a conversation with the flight instructor who does the written examinations. He tells you that experience shows students who do well on midterms slack off on their final exam, while students who do badly generally get better.

What does this mean? Do students’ test scores tend to revert to some mean?

If you answered yes, you’ve just committed the famous “Galton’s fallacy”.

What did the 19th century statistician and geneticist Sir Francis Galton, Charles Darwin’s cousin, do wrong?

Well, he plotted the height of fathers against the height of their sons and discovered sons of tall fathers tended to be tall, but on average not as tall as their fathers. Similarly, sons of short fathers tended to be short, but on average not as short as their fathers. He—and this was his mistake—was immediately concerned that the sons of tall fathers are regressing into a pool of mediocrity along with the sons of everybody else.

So what’s wrong with that? Well, height tends to be normally distributed—i.e., the distribution takes a bell-shaped form—and sons’ heights are, due to heredity, correlated with the father’s height. So there is a linear dependence between fathers’ and the sons’ heights.

The link to the paper by Danny Quah is also well worth archiving.

Thursday, June 03, 2004

How To Write Unmaintainable Code

Or, as the Real Programmers™ would put it, how to guarantee job security ...

Abstracting Away the Data Access Layer in WordPress

One thing about WordPress that's been bothering for some time now has been the extent to which it is tied to MySQL. Whatever advantages such a close tie might have had in the past in terms of ease of development and performance optimization, going forward it strikes me as less than wise to insist that everyone install a copy of MySQL even if they already have something else in place, a point on which this discussion thread seems to indicate that many others are in substantial agreement.

The ideal scenario as I see it is that the WordPress controller component should have no knowledge whatsoever about the brand of RDBMS from and to which it is making requests for data, inserts, updates and the like; in fact, it shouldn't even care whether the backing store is an RDBMS or merely a plain text file, and adding support for a new data storage mechanism should be as simple as someone writing code that implements the interface required by the controller.

All of this assumes, of course, that WordPress is implemented as an MVC application, but I see no good reason why it shouldn't be, or even that it already isn't one for the most part; from what I've been able to make out of the code I've looked through thus far, the WP programmers certainly seem to have done a decent job of separating the view aspect of the application from the rest of it.

When Firewalls Fail

I'm always amused whenever I listen to some PHB type poo-poo the importance of continous security vigilance because "we have a Brand XYZ firewall that's never been hacked", and the following story illustrates why:

Two weeks ago, I taught a Guerilla .NET course for DevelopMentor in Boston. Two or three days ago, a student who listened to me rant about SQL Injection attacks during the Code Access Security module lecture sent us (myself and the other two instructors) the following. It's obviously been edited to protect the guilty:

"Hi, Ted. I want to thank you for the short primer on SQL injection attacks at the Guerrilla course in Woburn this month. We have a vendor who supplies us with electronic billing and payment services. (We send them billing data, and they present the bills to our customers and take the payments for us.) The week after the Guerrilla class I began to lose confidence in their application for various reasons, like seeing errors that included partial SQL statements, and in one case, a complete SQL statement that was accidentally left on a page from a debugging session. I told our company's business manager that I was 80% confident that I could hack into their site using SQL injection. He called the vendor, who swore up and down that after spending $83,000 on firewalls that no one could ever hack into their site, and that we should go ahead and try.

"After three hours and a Google search on SQL injection, I was running successful queries from their login page and I had their server emailing us the query results via xp_sendmail. I was also able to confirm that the SQL Server login they use for their application has sa rights. I got a list of their clients, and was able to create tables in their databases.

"The vendor promised that by the next morning everything would be fixed. So the next morning at 8:00 am I tried again. I was no longer able to get results via xp_sendmail, but I was able to shutdown their SQL Server service by sending a shutdown command. I followed that up with a friendly call to their tech support line to let them know that they needed to restart SQL Server--I didn't want to be too malicious. The guy at the other end of the line apparently had been there the entire night changing code and rolling out pages. He threatened to get on a plane, come to my office, and beat me up."
The IT world is chock full of complacent dumbasses. As long as a single port is open to the outside world, one has to continously be on the lookout for security problems - even if that port is port 80.