Today is the anniversary of the transfer of Alaska from the Russian Empire to the United States. Yes, that’s the best holiday that I could find. It’s less depressing than our taking over Puerto Rico.

Alaskan flag

I like the flag, at least.

Anyway, onward.

INTERN

The headline for this week is probably the optimization work. As I expected last week, the rusqlite crate doesn’t natively have a bulk-insert feature, so I needed to write code to generate the necessary SQL like this.

fn insert_bulk_word_tuples(
    sqlite: &Connection,
    words: Vec<IndexTuple>,
) {
    let placeholders = words.iter().map(|_| "(?,?,?,?)").collect::<Vec<_>>().join(", ");
    let query = format!(
        "INSERT INTO file_reverse_index (file,stem,offset,word) VALUES {}", placeholders
    );
    let mut values = Vec::<String>::new();

    for word in words {
        values.push(word.file.to_string());
        values.push(word.stem.to_string());
        values.push(word.offset.to_string());
        values.push(word.word);
    }

    sqlite.execute(&query, params_from_iter(values.iter())).unwrap();
}

The key line is the assignment of placeholders, which takes the input list and maps it to a series of templates for the library to fill in when we call .execute(). In between, We turn the input values into a linear list of strings, since the conversion to a parameter array (params_from_iter()) only works on a homogenous list of primitive values, as far as I can tell.

I say that this is the headline, because indexing my test folder has gone from more than a two-minute process to a one-second process, which is far more reasonable. And really, nothing else I did has nearly the emotional impact in text of “increased speed by a factor of more than one hundred.” It could probably be improved further, by batching together files until some maximum size that would otherwise require another segment of memory allocated. But that’s going to overcomplicate the code—including creating a value that may change over time, as memory becomes cheaper and it’s more expedient to address it in larger units—to speed up a task that only occurs once by…maybe a factor of ten.

However, some of this overcomplicated already needed to happen, because either SQLite or the library can’t handle more than 32,768 (215) arguments for a single INSERT statement. This means that the code needs to break up indexing—which has four arguments per row—to batches of 8,192 (213) words at a time, which is the latter half of the optimization. That means that the insert function already breaks a list down into right-sized batches, so if INTERN suddenly processes multiple files at a time and send the results to the function, it’ll be able to handle that.

It’s a much smaller optimization in terms of time saved, mostly allowing the application to make more sense, but INTERN also now respects the structure of git and Mercurial repositories, so that repository metadata and any files not checked in should mostly be ignored. While that’s not necessarily much faster, it does mean that garbage files won’t show up in searches…or maybe it doesn’t, just yet. Tests show occasional errors that I’m working on running down.

There’s also the start of a socket-based API that listens on (TCP) port 48813 and responds—if not necessarily in a useful way—to input. While not yet useful, it should be easy to expand into a server that returns search results.

Doritís Onomáton

There aren’t changes to be pushed to the Doritís Onomáton repository for it, yet, but I finally managed to make some time—including closing unnecessary applications to keep my rickety laptop from crashing—to test the app on my Android tablet.

That testing turned up a significant bug: I may be misinterpreting the logs, but it appears that it doesn’t save the API key or isn’t sending it to the server…even though it does both in the browser.

Next

For this week, I’ll soldier on with INTERN. Besides working out the issue with the un-watching process, adding a search API is the top priority, as well as setting up a test client. Time permitting, I may also look at processing web history, since what I’ve read is arguably more important to search through than my notes.

And at some point—maybe this week, if I get INTERN’s basic features done, but probably later—I will probably need to review and rewrite the local storage code for Doritís Onomáton…again.


Credits: The header image is Flag of Alaska by an anonymous Open Clip Art artist, released into the public domain.