Elizabeth’s RC blog

Day Twenty-Five - Introducing my language data pipeline project

2024-02-06T00:00:00+00:00

It’s about time for me to introduce the main project that I’ve been working on at the Recurse Center. This is a pipeline for collecting annotated text data for AI training and benchmarking. I’m working on this because it’s a fairly complex real-world process, it involves an area of my expertise (language data), and I’ll get to learn hands-on about orchestrators and other data engineering tools.

Data collection process

I’m still exploring data engineering, but I get the impression that in the most typical case, you have large pre-existing data sources (like a database of customer transactions on your e-commerce website), and your job is to transform it into information that will be helpful for your business in some way.

That is markedly different from the AI/ML data collection process I describe here, in which your goal is to create small hand-crafted batches of annotated data that meet a quality threshold. Here are some characteristics of this process:

The datasets are fairly small (on the order of 1000s to 10,000s of rows)
They contain natural language data: either written by humans, retrieved from an existing source, or generated by LLMs
Each row is accompanied by metadata (inherent to the row) and labels (to be provided by annotators)
The datasets are usually divided into smaller batches for gradual/incremental data delivery
These datasets must reach a threshold of one or more QA metrics (IAA measures such as Cohen’s kappa or Krippendorff’s alpha; P/R/F1, etc.) They often undergo multiple iterations of annotation to reach the desired quality.
Datasets are authored and annotated by what I will call “annotation resources”. Examples of these could be data vendors, crowdsourcing platforms, an internal linguist team, or even an automated LLM labeling process. Multiple annotation resources may work on a single project.

Problem definition

The naive approach to handling these datasets is manual: data is processed in Python scripts or notebooks, and data is passed back and forth in the form of csv/json files on file storage.
This approach causes problems such as:
- No single source of truth for dataset state
- Large amounts of repetitive manual work such as downloading files and running scripts
- Brittleness of pipeline to minor format changes

Workflow

Below is a sample of one possible workflow for collecting this type of annotated data. I’ve intentionally designed this workflow to demonstrate as many complexities as possible.

Note that this workflow features cycles. If data doesn’t meet the quality threshold, it gets sent back to the annotation resource for re-work. I’m interested to learn how to handle these cycles in an orchestrator. It’s not lost on me that DAG stands for “directed acyclic graph”, but I’m thinking that this could be handled by re-materializing at the offending step in dagster, for example.

My work so far

I’ve implemented the first four steps of the above flowchart in Dagster, a data orchestration platform that was recommended to me by several Recursers.
I’m using duckdb for the database. For someone who usually works in Python/dataframes, this has been a natural way to get accustomed to writing SQL queries. I’m guessing I’d need to use a cloud database service in a real production scenario, but I don’t have a good sense of how to pick one.
I’m using a subset of the Europarl dataset. I chose this dataset because it’s multilingual (my favorite), and features a lot of different language pairs, so I can enjoy trying to decipher European languages I don’t know while glancing at the data.

What’s next

I’m going to continue to implement steps from the pipeline in Dagster, but I’m not sure if I’m going to have time to finish all of them by the end of my time at the Recurse Center. At some point, I need to stop and think about how to package this up into a coherent portfolio project. Your input also welcome!

Day Eighteen - Build your own shell in Rust

2024-01-29T00:00:00+00:00

I’ve been doing the Build Your Own Shell challenge on Codecrafters to get more proficient in Rust.

I’ve just finished the “base stage”, and I have a minimally working shell in which I can use the following commands:

echo
type
exit 0
Look for arbitrary executables in PATH, run them, and capture their output

Here’s my code so far: Build your own shell - base stages At risk of being cheesy, part of building my own shell is breaking out of my shell, so I’m setting an intention for myself to:

Ask for a code review of this from my peers
Pair with someone on a new step of the project

Most of the problems I’ve worked on so far are the “medium” difficulty, which feels about right for my skill level. The problem descriptions are underspecified and require some research, both into Rust language features and how the shell works. Sometimes I feel that they should be more explicit about what the expected results of the tests are. Overall, I enjoy the structured format of the learning, the hints left by other users, and the immediate test results after git push.

Things I’ve learned so far

.collect() to put results of an iterator into a vector
All arms of a match have to return the same type
std::process::Command to send commands to the system (like subprocess)
pathsearch::find_executable_in_path (nice self-explanatory name)
I’ve gotten more comfortable matching with Some or None.
If you match a value x with itself, you can’t also have a _ (all other cases) arm (oops).
You need to do this to convert stdout to a string because it is in bytes:
- let stdout = String::from_utf8(output.stdout).unwrap()
- This is the type of thing I was made aware of at some previous stage of life and then immediately forgot.

Lingering questions

I still have a lot of .unwrap() in my code. This is supposed to be bad, and Rust did in fact panic several times while I was debugging, which is a sign that I haven’t handled cases that I should have anticipated. I think the recommended alternative is to match, but… that sounds like a lot of matching. Need to find and get comfortable with alternative solutions.
Are nested matches a good way to handle complex conditions?

Day Fifteen - Halfway point and next goals

2024-01-24T00:00:00+00:00

I can’t believe I’m already at the halfway point of the six-week batch I committed to.

I spent a lot of the first half meeting people, exploring topics, and “learning how to Recurse”. I’m glad that I’ve settled on two major projects for the second half. It’s surprising how things have fallen into place after I was feeling so despairing only a couple days ago.

Goals for weeks 4-6

Have the demo of my data pipeline project finished. This doesn’t need to contain all the features you’d need for managing annotated ML datasets in production, but it should:
- Store database in some kind of managed cloud solution
- Have a rudimentary “orchestrator” that moves a dataset from one step of the process to the next
- (Stretch goal) Have a basic web GUI?
Finish the Codecrafters “Build Your Own Shell” challenge in Rust. I only started this yesterday, and it’s so much fun. I’m learning a few things about the shell, while also getting more comfortable with Rust syntax and building a tangible project.
Do a lot more pair programming to achieve these goals! I’ve been shy about “driving” during pairing so far, but now that I have two meaty projects to share, I feel more ready.
(Stretch goal) Give one programming talk and one non-programming talk (topics: TBD).

All these goals feel ambitious to me, and I’m not sure I’ll have them done in three more weeks. I need to start the job hunt after that, so I’ll continue working on these projects and interacting with the RC community on the side.

Day Thirteen - Persistence

2024-01-22T00:00:00+00:00

Yesterday I settled on the project I’m going to focus on for the rest of this batch. I will build a tool that solves issues that commonly occur when handling annotated data for AI/ML use cases. I’m excited that I’ve decided to work on this. The project feels practical and right. I don’t know if anyone will ever use the tool I develop, but I will have the satisfaction of solving a problem for past me, and it should be a good way to get familiar with a lot of different data-related technologies quickly.

Today I was feeling down about my lack of progress. It feels like I’m still in exploring/planning mode, and I’m anxious about the fact that I’m approaching the end of week three out of a planned six. On the other hand, I’ve learned in life that there are certain processes that you can’t force to go at an arbitrary pace.

I’m glad I already have so many life experiences of situations when something took longer or more tries than expected. I was only accepted to grad school on my third try. Then it took me three years to finish my MS, and the process often felt unsustainable alongside parenting and full-time work. But in the end it was immensely satisfying, and here I am, barely six months later and embarking on a new learning journey, so I must be doing something right.

One thing that’s held me back this week is my unwillingness to ask questions. This is a long-ingrained habit that stems from the fear of appearing vulnerable. It’s funny how these lessons recur periodically in life - I’ve had to battle this one many times. But “learning generously” by asking questions, sharing your uncertainties, etc. is one of the self-directives at RC, and it’s one that’s impossible to practice by just grinding harder. In fact, you have to soften yourself for this, which I find agonizing. I will try to take this self-directive to heart in the coming days and weeks.

Day Eleven - Aha moment

2024-01-20T00:00:00+00:00

Last Friday I chatted with Michael, who was kind enough to tell me all about his experience in data engineering. He practically provided me with a custom curriculum for this area.

One of the technologies he recommended I look into was polars, a new dataframe library that promises to replace pandas. As I was browsing its website, the phrase “Built in Rust” caught my eye. It was one of those moments where you feel things starting to fall into place. Finally, I saw my disparate interests converging.

Side note: I haven’t even tried polars out, but I was excited to see this in the docs:

df_vertical_concat = pl.concat(
    [
        df_v1,
        df_v2,
    ],
    how="vertical",
)

I can’t express how much better this is than axis=0. It seems like a small change, but naming matters.

Day Nine - Rock-paper-scissors in Rust

2024-01-17T00:00:00+00:00

Today I paired with Farid and Marc on rock-paper-scissors in Rust, which was extremely fun and satisfying! I should mention this is the first Rust code I’ve written from scratch.

Here’s our code, if you’re interested.

Things I learned in pairing

Named loops! I recall reading about this in the Rust book, but Marc pointed out a handy situation to use them.
Remember that loops also have scope.
Initializing an empty String, then using push_str() to push a string slice onto it. This is one way of dealing with the String vs. string slice issue.
Also using as_str() to deal with this issue, though I don’t feel clear on which contexts this is needed in. This also belongs on my list of things to learn more about below.
Using the pattern matching syntax to deal with conditions, which feels really concise and powerful!

Things I’m still struggling with

References and dereferencing in general
.expect() and .unwrap()
- I’m doing some reading about the Result and Option types
I just need more finger exercise in this language

Please correct me if anything above seems not quite right!

Day Nine - Fits and starts

2024-01-16T00:00:00+00:00

Ballet sequence generator

I made great progress on my ballet sequence generator today! To recap, this is a program that generates the choreography for a ballet lesson. The format of a ballet lesson is highly structured, but there is variation in the individual step sequences that make up each subsection of the class.

My program successfully:

Generates valid sequences according to rules
Ensures that sequences match the length of the music

I implemented the grammar using nltk’s CFG (context-free grammar) class. This has convenient built-in functionality for reading a grammar from a file, generating and parsing sequences, etc. These are typically used to parse and generate natural language (for example, English), but there’s no reason they can’t be used for other types of rule-based systems that generate sequences.

Yesterday I was using a PCFG (probabilistic context-free grammar), which is the same as a regular CFG except that it has a probability associated with each rule. I abandoned this because I decided that assigning probabilities to the various outcomes of each rule wasn’t important. I just want to generate a fun ballet lesson!

Today my goal was to make the generated sequence match a given length of music. I experimented with ideas like further extending the CFG class to give each leaf node (each ballet step) a “length” attribute, or by building my own node class and then using a depth-first search strategy to only return sequences with the desired length. Finally I realized that I could continue using the CFG class and simply store the lengths of the steps separately, generate all the possible sequences, and reject sequences with an invalid length. Lazy but effective (for now).

There’s so much more I could do with this project, but I think I need to put it aside for a few days. The next step will be to try actually matching this up to music! I have never worked with audio before, so this should be fun.

Rust progress

Not so great was the progress on learning Rust. I had fun watching the team write tic-tac-toe at our Rust study group today, but was a little intimidated because I wouldn’t even have the first idea of how to go about many of these steps.

I’m trying to draw on my life experience in entering new areas of study that tells me that I just have to push through these feelings of fear and continue working and experimenting.

During my searching I came across a nice forum reply to a post asking why Rust code is so verbose: “One way to look at it is that Rust doesn’t focus on small programs with simple problems. It handles problems that arise in bigger, more complex programs.” I thought this was helpful context to keep in mind.

Day Seven - Ballet sequence generator

2024-01-14T00:00:00+00:00

Remote RC was somewhat better today, though I also forgot that I had a dentist appointment.

Today I did some work on my ballet sequence generator! This is a fun side project that I’m doing in Python to take a break from “serious” work.

A ballet class follows a predictable set of rules but has some variation. I’m trying to build a program that will generate valid ballet class sequences.

This could help ballet teachers brainstorm lessons when they’re suffering from choreographer’s block. It could also be a way for ballet students to practice dancing at home.

What I did today:

Write PCFG (probabilistic context-free grammar) in a text file
- Only for “plies” (the very first part of class) for now
Generate valid sequences from grammar using this extension to NLTK’s PCFG

What’s next:

Need to make the generated sequence match the length of the music.
- For now, assume that all music is in 4/4.
- Pass in the length of the music (“count”) as an int.
- Each leaf node (“terminal” in CFG terminology) has a count associated with it (or multiple possibilities, if a move can be performed over 2 or 4 counts, for example).
- The combined total of the counts at all the leaf nodes needs to match the length of the music.
  - This doesn’t need to scale to arbitrary lengths of music. It’s okay to return an error on failure.
- It should also be able to repeat parts of a sequence to “fill out” an extra-long count.
  - This is starting to sound like regex or something else - there’s no such thing as a repeater operator in the nltk cfg implementation. This makes me think that I need to further adapt the PCFG class, implement a version myself from scratch, or try a totally different kind of data structure.
Print the result out in a more readable format

Future ideas (that I may not get to):
Figure out how to extract count from actual audio file
- Remember to handle the fact that many of these ballet tracks have an “intro” that shouldn’t be counted
Display text cues along with playing music
Display some kind of graphics (maybe an image that fades in/out along with the music)
Play audio cues

Day Six - Starting remote work and more notes on Rust

2024-01-13T00:00:00+00:00

Okay, so I missed blogging Day Five. Suffice it to say that I thoroughly took advantage of my last day socializing in the physical Hub.

Fast-forward to today (Monday) - I’ve flown back to Seattle, reintegrated with family life, and now I’m starting to work remotely. This was a harder landing than I expected. I had forgotten how tiring it is to work from home.

Possible causes:

Lack of the fun people energy that I experienced at the RC hub.
Parenting takes a physical toll.
Constant visual reminders of household/admin tasks that haven’t gotten done are wigging me out.
I’m in my home office, where I’m accustomed to doing my paid employment, and I’m repeating unhelpful behavioral patterns from work.
The weather in Seattle is just plain demoralizing.

To do: think more closely about which of these factors I can change.

More notes on Rust:

It’s funny to read the Rust book and other sources without knowing any other lower-level programming languages, because everything is written from the perspective of solving the problems of C++ programmers. It just paints such a vivid picture of their anguish in C++ and (one assumes) relief at working in Rust.

The borrow checker seems like a lot of work to supposedly make my life easier. This is going to take a while to get used to.
Is one of the reasons Python is so slow that it’s constantly having to allocate and deallocate memory for lists / dynamic arrays?
I like the “permissions” terminology because it reminds me of my existing mental model of permissions in a filesystem.
Wait, don’t introduce a new permission (flow), I had just gotten used to the first three!
Liked the insight that I’ve also worked with memory/references in Python, it’s just mostly abstracted away. I seem to remember occasional issues with .copy() that could be solved by using .deepcopy(), but I rarely had occasion to use these in the first place.
In Rustlings, it’s extra satisfying that rust-analyzer also gets less angry in the VSCode file explorer panel the more exercises you complete!

Day Four - Cataloguing some feelings

2024-01-09T00:00:00+00:00

I felt guilty for spending almost no time at the computer today before realizing that it was my second-to-last day at the Hub and that socializing was probably the right thing to do.

I also felt nervous because it seemed like people around me picked things up very quickly. I have to remind myself that (a) everyone has a different level of experience, and (b) I’m a slow-to-warm-up type and often feel very hesitant about things until I get comfortable. Then I do fine.

I’ve also been deliriously happy. It’s like a little slice of paradise here. It’s rare in life to be in a place where everyone is there because they chose to be. In fact, I can’t think of anything like this. Grad school could (should?) have been like this, but many of my fellow students were facing various pressures, and maybe the top-down nature of the curriculum also made a difference.

Finally, I’ve enjoyed how eager people are to answer questions. On StackOverflow, people will scold you if you ask a question that’s already been answered. For another example, I was taken aback when I asked the hotel receptionist if she could recommend a local laundry, and she told me to Google it. But at the Recurse Center, people will jump into line to answer your beginner questions. This makes me think of how such a learning-oriented environment could be cultivated outside of RC.