While not the big deal that it once was, today is National Coming Out Day, part of a political movement that says—sometimes rightly, sometimes not—that it’s a critical for people in gender and sexual minorities to tell their loved ones and colleagues. The premise is that it’s much harder to hate or want to change a category of people that includes people who make your life better.
In other words, it’s not just a necessary act of courage to find out if your loved ones support you, but also a radical and fundamental kind of activism.
Today is also the International Day of the Girl Child—which I assume is a celebration of being overly verbose—International Newspaper Carrier Day for people who remember newspapers, Columbus Day, and the assortment of holidays that should replace Columbus Day, since Christopher Columbus was a complete jackass by the standards of his time. We should always remember—especially in this time when people will clutch their pearls over the dangers of canceling Columbus—that Columbus is considered an American hero, only because all the decent explorers were taken , and nobody else wanted to be associated with the creep.
And on that note…code.
As suggested in the last developer journal post, the week was dedicated to getting my INTERN on-boarded. While working in Rust has its occasional off-putting moments—the rules for when a function’s return value can be used immediately in an expression and when it needs to be stored in a temporary value feel completely arbitrary to me, and sometimes “crate” (library) documentation is useless—the language has developed a strong ecosystem, and my work has progressed fairly quickly.
So far, I’ve been able to get notifications on modified files, configure the program based on a settings file, and store the last known update and a primitive inverted index into an SQLite database stored with the settings.
Why am I using notifications? It’s probably viable to just re-scan all the files overnight, when I’m not using the computer. However, I wanted to play with the library. And yes, I’m also the sort of person who will write three paragraphs about something, then completely forget that I worked on it. After all, I have the notes, so I don’t need to remember it. So, this should help.
I opted to skip the file type analysis, for now. That may be something to regret, when it comes to searching for words that happen to also be keys in structured data files—like “title”—but if I need to delete the entire database to start over, that only loses me a few seconds. I did, however, decide to use a stemming library to generalize searching, but to also store the original word in the index, in case I want it later, to look for synonyms or something along those lines.
The data looks something like this random sampling, for those interested. We have the files.
The “modified” time is seconds since the UNIX epoch (GMT), so that file was last changed on Wednesday night—October 5th, which probably shouldn’t be surprising—my time. Then, we have our word stems.
You’ll notice that most short words are identical to their stems, but others are truncated in strange ways. We don’t actually care what the actual stem is, because the user should never see it. Rather, we just want related words to have related stems.
|ID||file ID||stem ID||word offset||word|
The last item, here, shows the value in having the stem. Use, used, uses, using, useful, and usefully are all given the same stem. I might not remember how I used a word—singular or plural, verb tense, and so on—so this helps catch the possibilities. But I also keep the original word, so that I can (eventually) prioritize results that used the same words used.
That last advantage also goes for the word offset. Generally, we don’t remember random words in a document; we remember phrases or at least related words. So, the offsets will allow the search results to assume that clustered words are more relevant than scattered words.
That’s not too bad, considering that the project is only eight days old.
For this week, a major goal is going to be to refactor INTERN, which has already started. It’s only two thousand lines of code, but it feels like I’ve already duplicated vast swaths of the logic that could easily be generalized and made more readable and probably significantly less fragile. Once that hits a stable point, I’ll work on adding a basic API, so that I can actually use the index that I’ve created. Presumably, with the API, this will also need at least a command-line client. A web interface might be overkill—and counter-productive, given that many files need specific applications to be easily read—but I’ll at least give it some thought.
Another useful idea would be to version the database. After all, eventually INTERN is going to need a new table or column, and therefore needs to know what updates to run. Whether that’s just a table that’s monitored in code or brings in an object-relational mapper with migrations, such as Diesel, is worth researching before deciding. Probably the latter, since as long as the project is in Rust, it might as well be used to explore the ecosystem.
It will require deleting and rebuilding the index—which takes time, but isn’t that long a task—but it might also be smart to find content to exclude. For example, the database averages almost three times the size of the files that it cares about, since it contains an entry for each word, plus an entry for the stem and the file. One way to limit this growth would be to skip any stop words, words that are too common or too structural to convey meaning in a search; in the sample data above, the word “that” probably doesn’t belong in the database at all. Similarly, most of my writing is under version control, meaning that I should abide by the “ignore” files and avoid the repository metadata. Currently, INTERN does neither, and the metadata can often make up half the size of the repository, and those are certainly never the files that I’m looking for.
There are also time-based optimizations to make. For example, while I haven’t investigated, there is probably a way for the library to insert multiple rows into a table at once—it’s possible in SQL, after all, so at worst it should be possible to dynamically generate the statement—so that I can collect the words and stems in memory, rather than hitting the database hundreds of times per file. Likewise, rather than deleting and rebuilding the file’s index every time we detect a change, I could compare the index with the file and make any necessary changes. In addition to speeding up that process, it would also allow me to search for deleted words, which might or might not still be under version control, but still might be what I remember from the file. You might point out that deleted words would bloat the database, but it’s pretty rare that I delete large amounts of text, especially in anything that I consider to be “notes.” I’d rather have dated updates, if something turns out to be wrong.
Reports are probably on the agenda, at some point, too. Especially if I track any deleted words, it opens the door to tracking the changed word count over time, sentiment analysis, times of activity bursts during the day, words that I maybe overuse, and so forth. I did similar work on SlackBackup, if you want a sense of what that could look like. I can’t recommend the use of the backup program anymore, since the version of Electron is years out of date and Slack is trying to eliminate their token-based API, but I still consider it a decent example of wringing useful data out of idle thoughts…though I have to admit that I dropped the ball in making it clear that sentiment analysis alone isn’t nearly as interesting as the change in sentiment over time. Individual sentiment values could indicate emotion or merely accustomed vocabulary and grammatical structures, but a change suggests either deliberate action (“trying to sound more positive”) or a subconscious change.
Anyway, mostly the refactoring for the next few days, but this should give some indication of where the project might go.
Credits: The header image is Walk In Closet — Expandable Closet Rod and Shelf by Wjablow, made available under the terms of the Creative Commons Attribution-Share Alike 3.0 Unported license.
Tags: programming project devjournal