Parsing MIDI messages in Rust
Monday, December 9, 2024
I'm working on a terrible idea of a project, and this project uses MIDI. That means I need a MIDI implementation! I chose to use an existing library, midir, to connect to devices and receive messages. But the reason I was interested in this not-yet-announced project is because I wanted to understand MIDI. So it was time to implement the communication protocol myself.
What is MIDI and why's it cool?
MIDI stands for Musical Instrument Digital Interface, and it really doesn't bury the lede. It's a standard for digital musical instruments communicate! The standard includes both the electronics and hardware, and it includes the communication protocol. This post is only concerned with the communication protocol.
In 1980, there was no standard digital interface for how instruments communicate1. You had electronic instruments, and it would be nice to put them together, but manufacturers did their own things. Roland created a couple of protocols for their own devices, then pulled in some folks from other companies to make a standard. This eventually became MIDI, and the first MIDI device was released in 1983.
Since then, some new functionality has been added, but the core has remained the same. There are a few new(er) standards that are in various stages of use. For example, Open Sound Control (OSC) is used in some instruments and applications, and was created in the early 2000s. And MIDI 2.0 has been announced but doesn't have widespread adoption.
MIDI, the original, is still in widespread use because it works and it's ubiquitous. Let's marvel at that a little bit: this protocol has lasted over 40 years, and it has a successor which isn't widely implemented. It certainly has some flaws and quirks, and it has limitations, and that's why we'll eventually see it replaced or surpassed. But its longevity is simply incredible.
When you connect devices with MIDI, each can send and receive messages. These messages let you do things like turn on and off a specific note (play A4 at this volume), bend the pitch, and change parameters of the synthesizer. It also lets you do things like synchronize timing, select a song to play, or perform manufacturer-specific commands.
A common MIDI use case is connecting a controller (a keyboard, wind controller, or other instrument with controls you can activate) to a synthesizer (hardware or software). Not every electronic instrument can make sound on its own, and this lets you decouple those pieces! You can also connect multiple devices together, so you can have one sequencer that's recording MIDI events and replaying them, and then have other controllers feed into that, and have the whole system output to a synthesizer (or use one that's onboard). There's a lot of flexibility and you can do some really neat things with MIDI, especially since you can edit the actual note on/off events and their timing and play around with those!
So... all I want with MIDI is to make my computer listen to some MIDI messages and do something based on them. Let's look at how the protocol works.
The basic workings
The fundamental unit of the MIDI protocol is a message. Each message has one status byte, followed by some number of data bytes. A status byte starts with a leading 1, and data bytes have a leading 0 (so they are effectively 7-bit bytes)2.
And each message falls into a particular group. The main groups of messages are voice messages, system common messages, and system real-time messages3. Voice messages tell you about playing sounds: note on/off, key pressure, pitch bends, etc. System common messages let you do manufacturer-specific things and control positioning in songs/sequences. And system real-time messages are for timing, mostly.
The overall structure of a MIDI message is a status byte followed by some data bytes. For example, if we have two data bytes, we could draw it like this.
What can the status be, and what do the data bytes represent? It depends on the kind of message.
Voice messages
There are seven voice messages: Note Off, Note On, Aftertouch, Control Change, Program Change, Channel Pressure, and Pitch Wheel.
We might get a message with the value 0x904851 (three bytes, in hex4).
To parse this, we deal with the status byte first. For voice messages, we split this into two pieces: the first nibble (four bits) is the category of message, and the second nibble is the channel (values 0-15, representing channels 1-16).
So when we look at this nibbles for our message, we see that this message is the category 0x9 and the channel is 0x0. 0x9 denotes a Note On event, which for a keyboard is sent when a key is pressed down.
Given it's Note On, we expect two data bytes to follow. The first, 0x48, is the note number. The second, 0x51, is the velocity. So our note is 72 with a velocity of 81! This corresponds to a C4 at roughly 60% of the available volume.
The rest of the voice messages are parsed in the same way, with 1 or 2 data bytes. Program Change and Channel Pressure have one data byte, and the rest have two data bytes. The data bytes have different meanings based on the category, but most of these are either a single 7-bit value or a pair of 7-bit values. The exception is Pitch Wheel, which is a 14-bit value that you reconstruct from the two 7-bit halves.
A fun part of the spec here: If you send a Note On message with velocity of 0, it must be interpreted as a Note Off message (which itself also has a velocity for the release speed). The keyboard I have functions like this, only sending Note On, while my wind synth sends both types. This is all valid and compliant with the spec.
System common messages
These are much like voice messages, but the status byte is used entirely for what kind of message it is. These are general to all connected MIDI devices, so we don't specify the channel.
There are five system common messages: System Exclusive, MIDI Time Code, Song Position Pointer, Song Select, and Tune Request.
Tune Request has no data bytes, so it's just the one status byte. Song Position Pointers are parsed like Pitch Wheel, where we have two data bytes which form one 14-bit value. Song Select is parsed as a single byte value which specifies the song number. And MIDI Time Code is also a single byte, but it's parsed as two nibbles for a message type and value.
The most interesting one is System Exclusive (SysEx). This basically gives us arbitrary per-manufacturer message types, and anyone else is supposed to close their ears and ignore that one if it's not for them. Some of these are things like bulk data dumps or listing patch parameters.
SysEx messages start with 1 or 3 bytes for the manufacturer ID, and then the rest is data. These messages are arbitrary length, and are terminated by finding the byte 0xF7.
System real-time messages
The final category of messages are system real-time messages. To me, these seem both simple and utterly cursed.
The simple part: there are seven messages, each of which is only one byte. You have Clock, Tick, Start, Stop, Continue, Active Sensing, and Reset.
Let's use Clock as an example. It's sent 24 times per quarter note, and the entire message is its status byte: 0xF8.
Okay, what's wrong with that?
Nothing, the message itself is fine. It's just where you can put it.
What seems utterly cursed is that you can put these anywhere and it's valid. In between bytes of other messages, sure! So the Note On message we had could be received instead as 0x9048F851.
Like... I get it. This means we can send these messages at exact times so that timing is locked in. But the rest of the messages, except SysEx, are at most 3 bytes total. It seems a little unnecessary to do this! And it makes parsing more complicated, because you have to check each byte for if it's a system real-time message, instead of knowing that the next couple of bytes are definitely for this message.
A parser combinator appears
Now that we've looked at how the protocol works, let's parse it! There are a variety of ways to write parsers in Rust. I chose to use a parser combinator since it's a relatively simple approach here, and it lets us write a lot of reusable code.
Structuring our data
First we need the structures we're parsing into.
We can define enums for each of the three message groups.
There's an extra in each of these—Unknown
—to provide a fallback if we run into one of the reserved status bytes.
#[derive(PartialEq, Eq, Debug, Clone)]
pub enum VoiceCategory {
NoteOff { note: u8, velocity: u8 },
NoteOn { note: u8, velocity: u8 },
AfterTouch { note: u8, pressure: u8 },
ControlChange { controller: u8, value: u8 },
ProgramChange { value: u8 },
ChannelPressure { pressure: u8 },
PitchWheel { value: u16 },
Unknown,
}
#[derive(PartialEq, Eq, Debug, Clone)]
pub enum SystemCommon {
SystemExclusive { data: Vec<u8> },
MidiTimeCode { time_code: u8 },
SongPositionPointer { value: u16 },
SongSelect { song_number: u8 },
TuneRequest,
Unknown,
}
#[derive(PartialEq, Eq, Debug, Clone)]
pub enum SystemRealtime {
Clock,
Tick,
Start,
Stop,
Continue,
ActiveSense,
Reset,
Unknown,
}
Then we define a struct for the VoiceMessage, since we also want the channel information.
#[derive(PartialEq, Eq, Debug, Clone)]
pub struct VoiceMessage {
pub category: VoiceCategory,
pub channel: u8,
}
impl VoiceMessage {
pub fn new(category: VoiceCategory, channel: u8) -> VoiceMessage {
VoiceMessage { category, channel }
}
}
And we make a high-level enum to contain each of the message groups. This approach lets us treat entire groups of messages in the same way, rather than having to match on each individual message type. You could certainly make one big enum for all of them, though!
#[derive(PartialEq, Eq, Debug, Clone)]
pub enum Message {
Voice(VoiceMessage),
System(SystemCommon),
Realtime(SystemRealtime),
}
Building our parser
Parser combinators are neat because they let you combine small, discrete pieces into a larger whole. You define small parsers, then build your parser by combining these together in various ways!
Top-level parser
We'll start at the high level.
Since we have three different message types, we'll define three parsers, one for each message type.
So our top-level parser will return the main Message
enum from above, and it will call each of our individual parsers in turn.
Which parser to call is determined by the range the status byte is in.
pub fn parse_message(bytes: &[u8]) -> IResult<&[u8], Message> {
let (bytes, status_byte) = take(1usize)(bytes)?;
let status_byte = status_byte[0];
// TODO: implement running status; see [1].
// [1]: http://midi.teragonaudio.com/tech/midispec/run.htm
if status_byte < 0xF0 {
let (bytes, vm) = parse_voice_message(status_byte, bytes)?;
Ok((bytes, Message::Voice(vm)))
} else if status_byte < 0xf8 {
let (bytes, sc) = parse_system_common(status_byte, bytes)?;
Ok((bytes, Message::System(sc)))
} else {
let sr = parse_system_realtime(status_byte);
Ok((bytes, Message::Realtime(sr)))
}
}
A few things to note in parse_message
:
take
is a parser that's defined by nom for us, and calling it defines a new parser that takes the specified number of bytes. When you invoke this parser onbytes
, the return value is a tuple of the remaining bytes after parsing, along with a slice of the taken bytes.- There's a TODO in here, because you can implement running status, which lets the status byte be omitted if nothing else has happened in the meantime. I chose to ignore this for now, until some piece of hardware requires I implement it.
- The library I'm using for MIDI device discovery also chunks messages for me, so I don't think I can run into the interleaved messages situation in this code, which is why I have a separate parser for it and am ignoring that here.
Now let's look at how those child parsers are implemented!
Parsing system real-time messages
System real-time messages are the simplest, since they're just one byte, so we can knock that parser out quickly. It's just a big old match statement. We check the byte that's passed in and we return the appropriate value.
pub fn parse_system_realtime(status_byte: u8) -> SystemRealtime {
match status_byte {
0xf8 => SystemRealtime::Clock,
0xf9 => SystemRealtime::Tick,
0xfa => SystemRealtime::Start,
0xfb => SystemRealtime::Continue,
0xfc => SystemRealtime::Stop,
0xfe => SystemRealtime::ActiveSense,
0xff => SystemRealtime::Reset,
_ => SystemRealtime::Unknown,
}
}
Interlude: helper parsers
Okay, so I know what's coming up for the other two parsers we need. We'll need some helper functions or we'll have a lot of repetition. Let's knock those out here for clarity of exposition.
We'll need to handle forming messages from one data byte or two data bytes, and we'll need to parse 14-bit values.
To handle one- or two-byte messages, we can define a parser which takes in a function.
This function will take in one or two bytes as parameters and should return a message of the type we want.
For example, to form a ProgramChange message, we may pass in |value| VoiceCategory::ProgramChange { value })
: a lambda function which takes in the 8-bit value and constructs just a ProgramChange instance.
This is all just a convenience so we can snag the one or two bytes we need and invoke a constructor with them.
pub fn one_byte_message<T, F>(bytes: &[u8], f: F) -> IResult<&[u8], T>
where
F: Fn(u8) -> T,
{
let (bytes, b) = take(1usize)(bytes)?;
Ok((bytes, f(b[0])))
}
pub fn two_byte_message<T, F>(bytes: &[u8], f: F) -> IResult<&[u8], T>
where
F: Fn(u8, u8) -> T,
{
let (bytes, b) = take(2usize)(bytes)?;
Ok((bytes, f(b[0], b[1])))
}
Parsing a 14-bit value is also pretty straightforward, but uses bit manipulation that may be unfamiliar.
We snag two bytes using nom's take
parser, then we shift the first byte left by 7 bits and bitwise or it with the second byte.
pub fn take_14_bit_value(bytes: &[u8]) -> IResult<&[u8], u16> {
let (bytes, db) = take(2usize)(bytes)?;
let value = ((db[0] as u16) << 7) | db[1] as u16;
Ok((bytes, value))
}
Okay, now we have our little helpers. Back to business!
Parsing voice messages
As mentioned before, voice messages have two pieces of data in their status byte: the category and the channel. So the first step in parsing them is to extract that.
Let's make our parser function, which will take in the status byte and the remaining bytes, and will return a Result (IResult is a nom-specific variant that already includes the error type for us). To start with, we'll pull out the category and channel from the status byte, and we'll lay out the structure for handling different cases.
pub fn parse_voice_message(status_byte: u8, remainder: &[u8]) -> IResult<&[u8], VoiceMessage> {
let category_nibble = 0xf0 & status_byte;
let channel = 0x0f & status_byte;
let (remainder, category) = match category_nibble {
// ...
}
Ok((remainder, VoiceMessage::new(category, channel)))
}
Now the question is what we do in each of those cases.
It's easy to handle AfterTouch, ControlCHange, ProgramChange, and ChannelPressure entirely in terms of the helpers we defined before.
The following match arms are to be added inside the match
in the previous code sample
0xa0 => two_byte_message(remainder, |note, pressure| {
VoiceCategory::AfterTouch { note, pressure }
})?,
0xb0 => two_byte_message(remainder, |controller, value| {
VoiceCategory::ControlChange { controller, value }
})?,
0xc0 => one_byte_message(remainder, |value| {
VoiceCategory::ProgramChange { value }
})?,
0xd0 => one_byte_message(remainder, |pressure| {
VoiceCategory::ChannelPressure { pressure }
})?,
Then we can handle the pitch wheel, which is like these but needs to use the 14-bit parser. This is our parser for it, which we'll call inside the match as well.
pub fn parse_pitch_wheel(bytes: &[u8]) -> IResult<&[u8], VoiceCategory> {
let (bytes, value) = take_14_bit_value(bytes)?;
Ok((bytes, VoiceCategory::PitchWheel { value }))
}
And finally we get to parse voice notes!
Since we have the funky behavior of NoteOn, with velocity=0 denoting sending a NoteOff, we'll use one function for these together.
But we still get to reuse those helpers!
It takes in both the byte slice and a boolean, off
, which is used to say whether this is certainly a note-off event or not.
pub fn parse_voice_note(bytes: &[u8], off: bool) -> IResult<&[u8], VoiceCategory> {
two_byte_message(bytes, |note, velocity| {
if velocity == 0 || off {
VoiceCategory::NoteOff { note, velocity }
} else {
VoiceCategory::NoteOn { note, velocity }
}
})
}
And we add the remaining three cases to our match, along with the default case.
0x80 => parse_voice_note(remainder, true)?,
0x90 => parse_voice_note(remainder, false)?,
0xe0 => parse_pitch_wheel(remainder)?,
_ => (remainder, VoiceCategory::Unknown),
Put it all together, and we get a parser for voice messages!
Parsing system common messages
The last group of messages is a lot of the same, so I'll start with the full definition here and then dive into the interesting part, SysEx messages.
I'm including one unseen function definition in here, parse_song_position_pointer
, since it's the same as the pitch wheel one except it returns a different variant—all the parsing is the same.
fn parse_system_common(status_byte: u8, bytes: &[u8]) -> IResult<&[u8], SystemCommon> {
match status_byte {
0xf0 => parse_system_exclusive(bytes),
0xf1 => one_byte_message(bytes, |time_code| {
SystemCommon::MidiTimeCode { time_code }
}),
0xf2 => parse_song_position_pointer(bytes),
0xf3 => one_byte_message(bytes, |song_number| {
SystemCommon::SongSelect { song_number }
}),
0xf6 => Ok((bytes, SystemCommon::TuneRequest)),
_ => Ok((bytes, SystemCommon::Unknown)),
}
}
pub fn parse_song_position_pointer(bytes: &[u8]) -> IResult<&[u8], SystemCommon> {
let (remainder, value) = take_14_bit_value(bytes)?;
Ok((remainder, SystemCommon::SongPositionPointer { value }))
}
Now the only thing we haven't seen is parse_system_exclusive
.
Remember that it's a dynamically sized message.
Once we detect the SysEx starting byte (0xf0), we just take and take and take until we find the ending byte, which is 0xf7.
We'll leverage a couple of nom combinators for this:
take_till
accepts a function as an argument and takes bytes until that function returns truetag
accepts a string or byte array and expects that to be the next value, failing if it isn't present
And then we can shove this all into a Vec and call it a day.
Putting it together, we get this short function.
pub fn parse_system_exclusive(bytes: &[u8]) -> IResult<&[u8], SystemCommon> {
let (remainder, data) = take_till(is_status_byte)(bytes)?;
let (remainder, _) = tag([0xf7])(remainder)?;
let data: Vec<u8> = data.into();
Ok((remainder, SystemCommon::SystemExclusive { data }))
}
And that's it, we've handled the system common messages!
Which means we've handled all the message types.
Can I use it?
This code is for a project I'm working on and I don't have that open source yet (it probably will be eventually!), so no (or not yet). Besides, you probably don't want to: it hasn't been profiled for performance, it's only tested against two of my instruments, and it's likely to have breaking changes soon. There are other libraries out there that are a better choice if you do want to use them, such as midi-msg and midly.
It's also a pretty simple protocol, and it's fun to build your own small parsers!
If you've done anything fun with MIDI or Rust, I'd love to hear about it. Just send me an email (listed below).
Thank you to Robbie for very helpful feedback and corrections, including the note about CV/gate. Any remaining errors are my own.
There was already a method for analog interfacing between equipment called CV/gate. This was around at least as early as 1970, but I can't find a lot of info on this.
This means you can do a bitwise & 0x80
to check if something is a status or data byte.
There are also channel mode messages, which are a variation on a particular kind of voice message. I'm leaving them out here for clarity, but they do technically exist.
I'll represent everything that is hexadecimal with 0x at the start.
If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts and support my work, subscribe to the newsletter. There is also an RSS feed.
Want to become a better programmer?
Join the Recurse Center!
Want to hire great programmers?
Hire via Recurse Center!