🤖 uh-halp-data

Over the Christmas break I started working on a dataset to train a small model that can be used to run uh-halp offline.

Putting it mildly, this was a bit of a slog.

💤 tl;dr

📺 uh what?

uh is a command line tool that tells you what to type on the command line, for people who are old and forgetful like me. A picture is worth 0x1000 DWORDS, and a video even more.

With that in mind, here’s an example:

demo

Problem is, the thing needs to send your queries to an external service, which means having an account and a key and a config and an Internet connection and so on and so on and so on…. Which I’m not a big fan of.

If your freedom to use a program is mediated by an LLM provider then it can’t really be called free software. But I recently got myself an Nvidia Orin and can run language models now. At least if I’ve got a connection to the thing and ollama is running, which users might not have.

And llama can create data… so, why not have a stab at making a model that can run offline, on machines with minimal compute and space resources?

Sounds like a fun project, right? This view aged like warm milk.

💽 the data

The app takes questions and produces answers like this:

Q: how much space is left here
A: df -h .
Q: how many files are in here
A: find . -type f | wc -l
Q: ssh me@box but forward X so I can do gui stuff
A: ssh -X me@box

So training data needs to be in this format.


🦣 0. A data collection pipeline for mammoths

And so, I decided to embark on a gigantosauric data collection exercise. One of many maybe, or 0.5 of 0.5 if I never actually finish it.

I drive the process with a Makefile, which has target data files as outputs and scripts as inputs. This, as usual, is a mixed blessing. Having suffered this process a couple of times now, I might not do it this way again.

Or maybe I will because I prefer to gloat about other people’s mistakes than learn from my own.

The approach looks like this:

setup

🤦 mental challengings


📦 1. ALL THE BINARIES

First up, we need to know all the commands that exist. This impossible task is somewhat possible in Ubuntu as we can install apt-file, run apt-file update then grep its lz4 archives for stuff in /bin and /usr/bin.

That doesn’t give us everything, but if you add in /etc/alternatives and all the default /sbin and the shell builtins then you’ve got a pretty solid starting point.

uh-halp supports Windows too, but I’ll eat that can of worms when I have the stomach for it. It’s not like you can just do pip install uh-halp and the thing appears on your path in Windows, because python is buried in an ever-moving dir. Bronchitis limits the amount of time somebody has got for Windows anyway.

So just Ubuntu packages for now. Using Docker. Easy-peasy:

step 1


📈 2. Popularity contest

Then we need to know which ones people are actually likely to type in to a computer, and in turn are likely to need help running.

One way I could do this is by pulling all the .sh files I can get my hands on, or running strings on the entire world, and tallying them up. Or I could carefully parse the package list and look at deps of deps, and see which ones are used that way. Or I could type in as many commands as I can remember - a list that’s getting shorter by the day.

But I’ve got a shiny new AI box thing, and fifty quid still sat on vast.ai burning a hole in my account. So why not get llama3 to do it for me?

So that’s what I did, tell the language model a white lie and have it sort the names of programs by likelihood of being typed into a terminal:

You are bash-cache-priority, an AI program that decides which commands the user
is most likely to type. Input is a list of binaries. Output is an ordered list
by likelihood of being entered on the keyboard. Respond with just the command
names in order of likelihood.

And pass it a list of 10 random commands. And then the next 10 and so on.

Once it’s ranked each group, I give them a score based on their rank order. Then sort, take the mean and repeat for the top half. Using a language model as a sort function is a pretty cool trick.

step 2

🤦 shouldawouldacoulda


🐋 3. ALL THE PACKAGES

Now we just need to install each of the packages, right? Should be simple enough, as long as we’re sensible and only install the few hundred that matter…

Turns out the datahoarder in me wouldn’t allow that, so I ran it to completion for aarch64 and x86_64. Which cost a bit of disk space…

REPOSITORY                        TAG                    CREATED        SIZE
uh-halp-data-binaries             ubuntu-13000-aarch64   2 hours ago    201GB
uh-halp-data-binaries             ubuntu-12500-aarch64   5 hours ago    194GB
uh-halp-data-binaries             ubuntu-12000-aarch64   7 hours ago    188GB
...
uh-halp-data-binaries             ubuntu-1000-aarch64    30 hours ago   27.7GB
uh-halp-data-binaries             ubuntu-500-aarch64     31 hours ago   17.7GB
uh-halp-data-binaries             ubuntu-base-aarch64    32 hours ago   1.46GB

I pushed these up to Docker Hub, but I imagine they’ll get culled for using a piss taking amount of space.

Here they are though:

step 3

🤦 oofs


🛟 4. ALL THE HELPS

Next step is extracting the --help for each program. Simple, just call it for each program and save the output, right?

Yeah, right.

With 40,000 programs to run, you run into a lot of badly behaving ones. Ones that existed 60 years ago, ones that expect people in their problem domain to expect or put up with certain things. There’s a lot of variation.

So I create a separate dir for each program. I run it on a 1 second timeout and kill the thing if it takes too long. Then run it with -h if it failed. And extract the manpages too.

If the outputs are too large, I remove the dir instead of copying it out of the container. Then it was simply a matter of figuring out which of the 3 files contains the most useful information, not not too much, as it needs to be passed to an LLM for processing and leave some context space for answers.

The outputs of the help generation steps are here:

step 4

🤦 gahwtfkinshii


🐋 5. ALL THE TIME IN THE WORLD

Because the Docker images are too large to actually manage, I dumped the contents out, reset all the atimes, mounted the root dirs, and re-ran the help extractor again. Then removed all the files that weren’t accessed. Then ran UPX over all the binaries and rebuilt the image.

This brings the 208GB image down to a more respectable 13GB compressed, 24GB when installed. 🎉

Why not use docker-slim like a reasonable human being? Well, it failed on the base image and I didn’t want to wait for it to run for the larger ones.

step 5

🤦 snafu very much


📃 6. All the ways to do something

Early on during testing, it became pretty clear that llama, when asked for usage scenarios and given the manpages, it’d just regurgitate what is in the manual rather than come up with scenarios that a user would do.

So I needed to steer it in a way where it first generates user scenarios based on the program’s help, then make it generate questions within that context.

As an extra bonus, I want it to at least be aware of adjacent commands too. So here’s the prompt that survived a few rounds of testing:

There is a program called "$command_name". It performs the following function:

$help_text

We want to understand how this program is used in real-world scenarios.
Generate:
1. A one-line summary of the program's purpose.
2. List other commands that are frequently used with it, or are related.
3. A list of 10 realistic use cases where humans commonly use this program.
   Each use case should:
   - Be unique and practical
   - Focus on real-world tasks, ones that can be solved by typing the command
     name in.
   - Describe a reason why you'd type the program into the console.
   - Focus mainly on common scenarios, rather than exotic or speculative ones.

Here’s an example for the `ls` command:

**COMMAND NAME:** ls  
**SUMMARY:** Lists the contents of directories.  
**RELATED:**
* find, xargs, file, stat, du, sort
**USE:**
1. I want to see what files are in this directory.
2. Which file in here is the biggest?
3. Do any of these files have broken permissions?
4. Are there .txt files in this dir? 
5. Which one of these files just got written to?
6. How many files are in here?
7. What hidden files are in here?
8. Combine with `grep` to find specific filenames in a directory.
9. Pass these files to `xargs` and pass to another program.
10. Save this list of files for later.

Now, generate outputs for the following command:

**COMMAND NAME:** $command_name

And let it generate the scenarios.

step 6


👨‍🦰 7. uh gen

And finally, now we have some data that we can use to generate user scenarios. The primary user being me, of course.

There’s a few different scenarios I generally use:

A full manpage extract would provide data for the last scenario, but that can come later. Knowing the flags to every program in the Universe and how to install them would help a lot with the rest.

The prompt I settled on was to have the LLM act like a user in the above situations.

🚂 8. Training

And this is where I ran out of time, get back to work!

To be continued!