Announcing Treescan version 0.6 ``A Faster Find'' Jamie Lokier, 6th March 2000 Description ----------- Treescan finds all the files in a directory tree and prints their names, using an optimised disk access strategy. It is similar to `find -print'. The added feature is that Treescan optimises the I/O in various ways. It is sometimes much faster than the naive strategy used by `find'. There is an option to prime the system cache with the inodes of all the files found. This often makes a subsequent program (such as `squid', `find', `du' etc.) run faster. The latest source is at: http://www.tantalophile.demon.co.uk/treescan/treescan-0.6.tar.gz News ---- Changes in version 0.6: - Many bugs fixed. - `-stat' option actually works. It was broken in version 0.5. - ldiff regex quoting made more robust. - New command line option `-d_type' to show file type info for debugging. - A Linux kernel patch that gives us directory entry type information. - Much faster using Linux directory entry type support. - Fixed detection of empty directories. - Faster on DJGPP (not tested). - Document describing some of the optimisations. Portability ----------- Treescan uses GNU Autoconf for configuration, and should be portable to many unix-like operating systems. It should work even with old operating systems. I've tried to support anything that GNU Find would run on. If you find a system which Treescan doesn't compile on, but GNU Find does, please report that as a bug. Uses ---- + Treescan is handy when you want a list of all the files in a directory tree. It is sometimes much faster than `find'. + `treescan /var/spool/squid -stat' is a very good thing to have in your Squid startup script, just before starting Squid. Total startup time for me is 6.7 seconds instead of 2 minutes, the disk does not thrash noisily, and the system doesn't chug for 2 minutes after booting. + Treescan can be used in your `updatedb' script if you use one. + Treescan is good to run just before `gendiff' on Linux source trees. It makes `gendiff' run a lot faster. Similar gains occur with `find' and `du' and so forth. Caveats ------- - File names are not output in the same order as `find' nor any other sensible order. To get `find'-compatible behaviour, pipe the output through `sort'. This will be addressed in future. Efficiency ---------- On the computers I have (both running Linux), Treescan 0.6 is always faster than GNU Find 4.1. Sometimes the difference is dramatic, sometimes it's not but Treescan is always faster. Efficiency on different filesystems ----------------------------------- Treescan's strategy is optimised for any filesystem that fits these assumptions: Inode number = approximate position of inode on disk (or it's stored in the directory entry), and for directories, inode number = approximate position of directory on disk. Inodes may be stored separately from data, but it isn't required. The following filesystems suit the inode assumption. Ext2 and NFS are verified by experiment; the rest from reading Linux filesystem implementations. The directory assumption is not so reliable, but assuming it doesn't seem to make anything worse: - ext2fs (standard fs on Linux and Hurd) - Some implementations of FAT, VFAT and UMSDOS. - BFS (UnixWare). - AFFS (Amiga). - EFS (SGI CD-ROM filesystem). - HPFS (OS/2, I think). - Minix. - Romfs (Linux). - QNX4. - SysV (System V, Coherent and Xenix). - UFS (BSD and derivatives such as SunOS and NeXTstep). - NFS: Linux user-space nfsd/unfsd serving any of the above - NFS: Linux kernel-space knfsd serving any of the above - NFS: Probably every other Unix implementation of NFS, serving any of the above. Theoretically some filesystems may perform badly with Treescan, because the inode numbers are really just hash values. Note: the user-space NFS server `unfsd' works well with treescan -- the served inode numbers retain the original ordering properties well enough. The inode-ordering strategy can be disabled using the `-nosort' flag. In this case Treescan can still be faster than `find' because of other things it does. (The other things Treescan does are: fewer system calls, avoid directory name lookups when possible, scan directory inodes in clusters, and avoid calling stat() on non-directories or even reading the inodes from disk if the operating system supports the right features. (Linux doesn't really support d_type yet but I think BSD does). See the ALGORITHM file for details). There are older systems on which GNU Find may run faster, for the moment: specifically those without the fchdir() system call. We don't follow GNU Find's strategy on those systems. The problem will be addressed in future. The goal is to provide the same or better level of performance compared with `find' on all platforms. Efficiency on Linux ------------------- Some kernel patches are included that really speed up scanning. Apply them, and also run `tune2fs -O filetype' on all your ext2 filesystems using e2fsprogs-1.17 or later. (Don't forget to unmount the filesystems, run tune2fs, then e2fsck -- see the tune2fs man page). However, Treescan is faster than `find' on Linux even without the patches and filesystem enhancements. Why I wrote Treescan -------------------- I wrote Treescan to make the web cache Squid start faster. Every time I booted my computer, Squid would start, and spend the next 2 minutes thrashing the disk as hard as possible. It was accessing the disk in a more or less random fashion. Linux would crawl along for those 2 minutes, and my poor disk would make the noise that suffering disks make for that time. I prefer Linux to be lean and fast from the start :-) I prefer my disk not to be worn out by unnecessary heavy activity :-) Treescan solves this problem very well: 6 seconds instead of 2 minutes. Another goal is to make `updatedb' faster. That is the only other program which thrashes my disk heavily, and it does it every night. Treescan is now always faster than `find' for this in my tests. Now I've expanded the goal: to be as efficient as a good implementation of `find' in all reasonable cases. And then to package it into a library, because lots of programs want to list directories or get the attributes of many files efficiently. Work in progress ---------------- The scanning strategy still needs a bit of work for the large disk cases. Specifically, what is best to do when the buffer cache and dentry cache don't hold everything (Linux terminology)? Thoughts on this are particularly appreciated. I detect this by assuming "fd open == inode guaranteed in memory, fd closed == not guaranteed, always enough buffer cache to hold directory contents for search". In other words, things start going pear shaped when I run out of open directory file descriptors and have to close some. I consider that the closed inodes may get flushed from the inode cache at any time after that. The problem of which inodes to close when under pressure is a rather interesting logic optimisation problem, because there are interdependencies and unknowns. More things are in the TODO list. License ------- Treescan is copyrighted by me, and currently licensed under the GNU General Public License (GPL). If it looks like becoming a useful library, I will probably change it to the Lesser GPL (LGPL). Final notes ----------- It's not obvious what is the best strategy for handling the cases when I've run out of file descriptors. Also I want to stimulate some interest in this lest it turn into rusty old code while I'm busy with other things. So I decided to release this version to get some feedback. Please send me some! If this program helps you to spot bottlenecks in your favourite free kernel, so much the better! Enjoy, -- Jamie Lokier Jamie Lokier