Saturday, October 21, 2006

Scheme Search gets pattern matching

This week I had time to work on the search engine again.

From a user view point the most important change is the addition of pattern matching. Until now it was possible to find all documents, where a particular identifier occurs. If on the other hand you were unsure which identifier to search for, you were out of luck. Now you can search for occurrences of identifiers matching a regular expression.

Say you vaguely remember someone using a define- something - struct. A pattern match search for define alone gives 6993 hits. But a search for define-.*-struct gives only 22 hits (the first of which contains define-serializable-struct).

The implementation of the pattern matching search is kept very simple. There are only 150.000 search terms, so the naïve approach of simply matching all terms against the pattern one at a time is fast enough. At least for the moment... After the matching terms have been found, they are looked up in the index and finally the list of documents are ranked.

Under the hood the representation of the lexicon changed from an in-memory hash-table to a disk-based representation. This has two advantages: the web-server uses less memory and the lexicon uses less space on disk (although it resides in-memory, it was read in, when the web-server started). The disadvantage is that searches now requires disk-access. To keep disk access to a minimum, the lexicon is read in blocks of 100 terms, and a few recently used blocks are cached. If there is a need to look up several terms, it is now best to look them up in alphabetical order.

The plan is to use the disk space saved for more indexes. Perhaps an index over the documentation a la the HelpDesk? Unfortunately I haven't got enough disk-space enough to implement "preview" of search results.

In other news Google has released their Google Code Search. It indexes source from Sourceforge, Google Code and other public repositories.

To search for Scheme source add
file:(.scm|.ss|.sch)
to the beginning of your query.

Try the above and search for "srfi" to see the how many Scheme projects they have found.

Labels:

Monday, September 25, 2006

PLT Scheme Source Search Goes Live

An alpha version of PLT Scheme Source Search is now live. There is no bells and whistles, but the basic functionality is there. The source of all PLaneT packages and all builtin collections of PLT Scheme are indexed. The results link to the PLT Source Browser.

The PLT Scheme Source Search is the place to turn to, if you have questions like:
  • Is define or lambda most popular?
  • Who has the most TODOs in his source?
  • Are there any profanity?
  • Where is an example of local-expand?
  • Where is set! abused?
  • Is the number 42 more popular than 13?
The development of the search engine is described in these blog posts:
The next step in the project is to implement a disc based lexicon. Right now only the index is disc based, where as the lexicon is an in-memory hash table.

Labels:

Saturday, May 20, 2006

Selection of the top search results

A google for "Scheme" gives "Results 1 - 100 of about 356,000,000" in 0.40 seconds. Selecting these top 100 results could be done by sorting the 356,000,000 results after rank and returning the 100 greatest. It is however very seldom that a user will ever want to examine more than the top 500 results, thus it is unnecessary to sort more than the top results.

The trick to find the n greatest elements of a list xs is to use a heap.

Begin by inserting the first n elements of the list into a heap. The elements of the heap should be thought of as the top n candidates.

For each remaining element x of the list we compare it with the minimum element of the heap. If x is smaller than the minimum element, then the n heap elements are all greater than x, and x is therefore not in the top n. If on the other hand x is larger than the minimum element, then x is inserted into the heap instead of the previous minimum element.

Now the heap contains the top-n elements, and by continually deleting the minimum element, we will get a sorted list of the top n elements.

To find the top n elements out of m elements in this way will use time O(m log(n)). This is to be compared with O(m log(m)) from the naïve solution.

(require 
(only (planet "heap.scm" ("soegaard" "galore.plt" 2 0))
insert find-min delete-min empty? list->heap)
(lib "list.ss" "srfi" "1") ; for take and drop
(lib "67.ss" "srfi")) ; the compare srfi

; selection : compare list integer -> list
; return a sorted list of the n largest elements
; of the given list - the list is assumed to
; of length n or longer
(define (selection compare xs n)
(let ([cs (list->heap compare (take xs n))])
(let loop ([cs cs]
[xs (drop xs n)])
(cond
[(null? xs)
(heap->sorted-list cs)]
[(<? compare (car xs) (find-min cs))
(loop cs (cdr xs))]
[else
(loop (insert (car xs)
(delete-min cs))
(cdr xs))]))))

(define (heap->sorted-list h)
(cond
[(empty? h) '()]
[else (cons (find-min h)
(heap->sorted-list (delete-min h)))]))

> (selection integer-compare
(list 7 1 3 9 5 6 4 8 2)
4)
(6 7 8 9)

Labels:

Sunday, May 07, 2006

The Search Engine Materializes

Having an index over Scheme source is no fun without a way to perform searches and present results in an human-friendly way. This weekend I therefore wrote a little web-servlet, which takes a query, looks up the terms in the index, and returns links to documents containing all terms.

It didn't take many searches to realize that I need to spend some time on 1) ranking the results and 2) supporting both case sensitive and case insensitive queries.

In general ranking is difficult to get right, hopefully the narrow scope of indexing Scheme source only will help. Managing Gigabytes explains ranking in details, and since I keep track of the term frequencies in the index, the ground work has been done.

Supporting both case sensitive and insensitive searches with the same index can be done with a little trick: after tokenizing all terms are converted to lower case before they are put in the index. When a search is made the query is likewise converted to lower case before a search is made. The returned document numbers can be used directly for a case insensitive search. To avoid false matches for a case sensitive search, the actual documents are retrieved from a repository and a simple minded search for the query terms are made.

This approach makes sense when indexes are large and the repository is available. My plan is put the index on a web-server without the repository, so instead I plan to make to indexes one for each type of search mode. Fortunately the code written so far is prepared for different indexes.

In lieu of an online web-servlet to try, I offer a screen shot:

Labels:

Wednesday, April 26, 2006

Compression: Integer Encodings

Quick! How many bits are needed to store a natural number n?

Well, let's look at the standard encoding. One bit is needed for 1 (1). Two bits are needed for 2 (10) and 3 (11), furthermore three bits are needed for 4 (100), ..., 7 (111). Hmm. log(1)=0, log(2)=1, log(4)=3, log(8)=4. So the number of bits needed is 1+floor(log n). Or is it?

Suppose the bits 110 are read from a file. Is this the number 3 or is it a 1 followed by a 2? We can't know unless we a priori know how many bits were used to store the number.

Leaving the standard encoding aside for a moment, consider instead the unary encoding.

nunary
10
210
3110
41110


The natural number is represented as n-1 1-bits followed by a single 0-bit.
The unary encoding does not have the same problem as before - it is a prefix-free code. A code for a number is never a prefix for the code of a different number.

The unary encoding is useful in situations where the majority of numbers are very small. For large numbers it is horribly ineffecient.

The problem with the standard encoding was, that we didn't know where the code ended. Consider now for the sake of argument encoding a number n by the unary code for the length of the standard encoding (1+floor(log n) followed by the standard encoding. The total length of that code is 2*(1+floor(log n). For large numbers this much better than the unary code.

A small observation saves a bit. Consider the encoding of 4: 110 100. The encoding starts with the unary encoding of 3, the length of the standard encoding. Since all numbers of length 3 have the bit pattern 1xx it is unncessary to actually store that bit. Thus we can just store 4 as 110 00. This bit saving encoding is called the gamma encoding.

To sum up, the gamma encoding of a natural number n consists of the unary code for 1+floor(log n) followed by the floor(log n) bits representing n-2^floor(log n)) in binary. The number of bits used by the gamma code is 2*floor(log n))+1.

For numbers smaller than 5 the unary encoding uses fewer bits than the gamma code, for 5 they use the same number, and for numbers larger than 5 the gamma codes uses fewer bits than the unary code.


A variation is of the gamma code is the delta code. The reasoning goes as follows: For numbers with a length larger than 5 in the standard encoding, it would be better store the length as a gamma code than a unary code. That is, the delta code for a natural number n is consists of the gamma code for the length of the standard encoding of n, followed by the standard encoding.

For numbers below 32 the gamma code is shorter than the delta code. For numbers between 32 and 53 the codes have the same length. The delta code for 64 and larger numbers are shorter than the gamma code.

A small excerpt from the implementation of these encodings - mail me if you are interested in a copy. [The Blogger software inserts spurious newlines in the atom feed. See Everything Scheme for the original.]

  (require "../bit-io/bit-io.scm"
(planet "42.ss" ("soegaard" "srfi.plt")))

;;;
;;; UNARY CODE
;;;

; The unary code for an integer n>=1 is n-1 one bits followed by a zero bit.
; The code for 3 is 110.

(define write-unary
(case-lambda
[(n)
(write-unary n (current-output-bit-port))]
[(n out-bit-port)
(unless (and (integer? n) (positive? n))
(error #f "a positive integer was expected, got: " n))
(if (> n 1)
(write-bits (sub1 n)
(sub1 (arithmetic-shift 2 (sub1 (sub1 n))))
out-bit-port))
(write-bits 1 0 out-bit-port)]))

(define read-unary
(case-lambda
[()
(read-unary (current-input-bit-port))]
[(in-bit-port)
(do ([n 1 (+ n 1)])
[(= (read-bits 1 in-bit-port) 0)
n])]))

Labels:

Friday, April 21, 2006

Reading and writing bits - bit-io.plt

As mentioned in "Index Construction" compression is used to store the index in a space efficient manner. The number of bits used to encode integers vary according to their size. The standard file operations read and write bytes, so the need for a bit level i/o library arose.

A little shopping around led to Oleg Kiselyov's page on Binary I/O. He presents a very nice solution to the problem of reading bit streams. The function make-bit-reader turns a byte stream into a bit stream. The byte stream is represented as a thunk, that produces a new byte each time it invoked. The returned bit-reader is represented as a function of one argument, the number of bits to read from the stream. The bits read are returned as an unsigned integer.

(define bit-reader (make-bit-reader (lambda () #b11000101)))
> (bit-reader 3)
6
> (bit-reader 4)


Inspired by this approach I wrote a bit-writer in the same style. Given a byte-writer the function make-bit-writer returns two values: the first is a bit-writer, a function to two arguments the number of bits to write and the actual bits, the second argument is a bit-flusher. Since the bit-writer can't call the underlying byte-writer before a whole byte is received it was necessary to introduce the bit-flusher to flush any remaining bits at the end of a file. The original code made were optimized to handle common cases such as reading a single bit or a single byte fast. An effort was made to mimic these optimizations in the writer.

Since the first version of the indexer used normal file operations such as with-output-to-file, write-byte and friends it wasn't straightforward to change the code to use the stream approach. The solution was to introduce bit-ports, which resembles normal ports. In the following examples the numbers from 1 to 8 are written to a temporary file and then read back. Each number is written using a different number of bits.

> (require (planet "bit-io.scm" ("soegaard" "bit-io.plt")))

> (with-output-to-bit-file "tmp"
(lambda ()
(do ([i 1 (+ i 1)]) [(= i 9) 'done]
(write-bits i i)))
'replace)
done

> (with-input-from-bit-file "tmp"
(lambda ()
(do ([i 1 (+ i 1)]
[ns '() (cons (read-bits i) ns)])
[(= i 9) ns])))
(8 7 6 5 4 3 2 1)



The bit-io library is available through PLaneT, the documentation and source are available at the PLT Source Browser.

Labels:

Sunday, April 16, 2006

Index Construction

Update: The last part were rewritten.

Sort-based index construction

The implementation of the indexer have progressed better than I expected. As a warmup to the implementation of the "sort-based multiway compressed" algorithm I have implemented both the "sort-based" and the "sort-based compressed" algorithm. They both work in four phases:
  1. For all terms in all documents a record (t, d, f) of the term number, the document number d and the frequency f of the term t in the document d is made in a temporary file.

  2. The temporary file is divided into blocks, which are sorted in-memory.

  3. A series of runs are made. Each run merges two sorted blocks. The end result is a large file consisting of (t,d,f) triples sorted in lexicographic order.

  4. Based on the (t,d,f) triples the index is constructed. For each term number t an inverted list of the form (t n d1 f1 di f2 ... dn fn) is written to the index file. Here n is the number of documents in which the term occurs; di and fi are associated document numbers and frequencies.
During the tokenizing in phase 1 a lexicon is made. The lexicon associates terms and term numbers. Furthermore during phase 4 the location of the term's inverted list in the index file is saved.

Searching for a term is simple: Use the lexicon to look up the location of the inverted list in the index file. Read the inverted list. Present the result - the frequencies are used to rank the documents after relevancy.

To keep the size of the index and the temporary file down the numbers aren't stored as standard 32 bit integers. Since an integer N requires only ceil(log(N)) bits, 32 bit integers waste a lot of bits for small numbers. Instead more flexible encodings are used in which small numbers are represented in fewer bits than larger numbers [More about bit encodings in a later post].

Sort-based index construction with compression

A standard compression trick is to use gaps to store ascending lists. Say a term is found in documents (2 6 9 13) then the corresponding gaps are (2 4 3 4). The original document numbers are found by adding the gaps, e.g. 9 = 2+4+3. The important thing to note is that the gaps are smaller than the original numbers and thus can be stored in fewer bits (if an approriate encoding is chosen).

This observation can be used to store the inverted lists, since the inverted list of a term (t n d1 f1 di f2 ... dn fn) has ascending document numbers di. That is, changing the representation of an inverted list to (t n dg1 f1 dgi f2 ... dgn fn), where dg stands for document gap, will reduce the size of the final index. The alert reader will note that the same trick applies to the term numbers in the final index. The extra alert reader will note that all terms are present in the final index and each term occurs only once, so we can omit the term number t completely from the final index.

Compressing the final index file will have relatively little effect on the total run time of the indexer. The bulk of the work is namely done in phase 3 during the external sorting of the temporary files consisting of the (t, d, f) triples. Now compressing these temporary files would reduce i/o considerably and hence also the run time. Can we use the same trick again? We sure can - the temporary files consists of blocks of triples and within each block the term numbers t occur in ascending order. The only complication of changing the representation from (t, d, f) to (tg,d,f), where tg is the term number gap, is some additional bookkeeping during merging.

The difference between the "sort-based" and the "sort-based compression" algorithm is thus that the latter compresses the temporary files.

One last observation: By fusing phase 1 and 2 a pass through the temporary file is saved.

Labels: