LLMs Don't Hear Music. Here's What They Actually Need.

I spent a few weeks building a system that decomposes music into structured data an LLM can reason about. The goal was to understand what it actually takes to get an LLM to understand audio production, not just classify genre or transcribe lyrics.
The short answer: a lot more than you'd think.
LLMs Are Text Machines
Start with what an LLM fundamentally is. A large language model processes sequential tokens. Text in, text out. Every capability it has, from reasoning to code generation to analysis, operates on symbolic representations. Sequences of discrete tokens that map to language.
When you hand a multimodal model an audio file, it doesn't hear music. It runs the signal through a pre-trained encoder that maps audio features to embeddings in the same space as text tokens. Those encoders are trained for specific tasks: speech recognition, sound classification, audio captioning. They can tell you "this is electronic music with a dark mood." They cannot tell you the sub-bass is sitting at C2 with 0.0 rhythmic regularity and a closing filter sweep.
That gap between "dark electronic" and "the drums have 147 pitch glides across 3 octaves with a tail-to-transient ratio of 10.71" is where production understanding actually lives. And nothing in a standard audio encoder captures it.
The Raw Signal Problem
A 90-second audio clip sampled at 22050 Hz is roughly 2 million floating point numbers between -1 and 1. That's the actual data. A waveform.
You can convert it to a spectrogram, a visual map of frequency energy over time, and feed that image to a vision model. I tried. The model can describe what it sees at a surface level: bright areas, energy distribution, rough structure. It cannot reliably extract that the vocal sits at A3 with a harmonic overtone ratio where the 2x harmonic is at 0.885 strength and the 3x is at 0.787. It doesn't know that the reverb tail-to-transient ratio of 1.22 means the vocal is sitting in a hall space. It sees colored pixels.
The problem is the same one that makes all signal-processing domains hard for LLMs: the raw representation doesn't match how the model processes information. The model needs structured, labeled, parametric data. It needs text.
Building the Translation Layer
Here's what actually worked. A three-stage decomposition pipeline.
Stage 1: Stem separation. Neural source separation (Demucs) splits the mixed track into isolated stems: vocals, drums, bass, and everything else. A second pass sub-separates drums into kick, snare, toms, hi-hat, ride, and crash. A third pass pulls guitar and piano out of the residual. One input file becomes 13 isolated audio streams.
Stage 2: Per-stem deep analysis. Each stem gets analyzed across 40+ dimensions:
- Pitch: Dominant frequency, note name, range, stability, glide count, voiced percentage. A bass stem with a dominant frequency of 65.8 Hz (C2), 49% voiced, and 62 detected glides tells you the bass is melodic but intermittent with lots of pitch movement.
- Harmonics: Overtone strength ratios at 2x through 6x the fundamental. Three or more strong overtones means rich/complex timbre. Zero means pure/simple. This is how you know a guitar sounds crunchy vs. clean without hearing it.
- Envelope: Attack time, quarter-energy distribution, sustain shape, variability. A 17-second attack classifies as a pad. A 0.05-second attack with high variability classifies as percussive. The envelope tells you how the sound breathes.
- Rhythm: Onset count, inter-onset intervals, regularity score. A regularity of 0.0 means completely free-form. 0.7+ means locked to a grid. This single number tells you more about how a stem feels than any adjective could.
- Loudness: LUFS curve over time, the shape of the energy. Building, fading, steady, dynamic.
- Reverb: Tail-to-transient ratio. Below 0.4 is dry. Above 1.5 is cathedral. The number tells you how much space the sound occupies.
- Effects: Distortion (zero-crossing rate + overtone density), compression (crest factor + RMS variability), sidechain pumping (beat-locked amplitude modulation), filter sweeps (spectral centroid movement relative to mean).
Stage 3: Structured output. All of it packaged as JSON. A parametric DNA report. For one 90-second track, the output is roughly 500 lines of structured data across every stem.
This is what the LLM can actually work with.
What This Unlocks
Once you have the structured representation, the LLM becomes genuinely useful for audio work. Not as a listener, but as a reasoning engine over extracted features.
Hand it two DNA reports and it can articulate precisely why Track A feels darker than Track B: the bass is 200 Hz lower, the reverb tail is 3x longer, the drums have no rhythmic regularity vs. a locked grid, and the spectral centroid sits 1,400 Hz lower across every stem.
It can generate production specifications. Given a DNA report, it can write a prompt describing exactly the sound architecture: tempo, key, per-stem behavior, effects chain, spatial characteristics, dynamic arc. Specific enough that another system or a producer could reconstruct the vibe.
It can identify what's hard to reproduce. The DNA report reveals production decisions that text prompts struggle to express: opposing filter directions across stems, energy drops timed to exact seconds, pitch modulation patterns in percussion that span three octaves. These become documented constraints rather than mysterious failures.
What Actually Took the Most Work
The pipeline itself runs in about 4 minutes per track. Building it took weeks, and most of that time was calibration.
Effects detection thresholds are brutal. My first distortion detector triggered on clean vocals because speech has a naturally higher zero-crossing rate. The filter sweep detector flagged every stem because natural timbral variation in any instrument causes spectral centroid movement. I had to raise the centroid range threshold from 60% above the mean to 100% above before it stopped crying wolf. Each threshold required analyzing real tracks, comparing the detector output against what my ears told me, and adjusting.
Edge cases compound. Reverb estimation fails on stems with no clear transients. Tempo detection gets confused by half-time feels. Whisper loops on short vocal clips. Modulation detection needs a fundamental redesign because amplitude changes and true modulation effects look identical in the signal. Every edge case I fixed revealed two more.
The reference library is the real investment. To know whether a 0.131 zero-crossing rate with 3 strong overtones means distortion or just a bright instrument, you need reference points. I decomposed 16 tracks across different genres and styles, documented what each detector got right and wrong, and used those signatures to calibrate. That library is what makes the system useful vs. just technically interesting.
The Pattern Is Bigger Than Music
This is the part that made the whole exercise worth it for me.
The fundamental pattern, decomposing an opaque signal into structured parameters so an LLM can reason about it, applies to any domain where the raw data isn't text.
Medical imaging: an X-ray is just a 2D array of pixel intensities. An LLM looking at the image can describe anatomy at a surface level. A decomposition layer that extracts bone density measurements, joint angles, soft tissue contrast ratios, and lesion morphology gives the LLM structured features to reason about diagnostically.
Financial time series: price data is a sequence of numbers. Decompose it into regime (trending, mean-reverting, volatile), structural breaks, volume profile, momentum indicators, and event timestamps, and the LLM can reason about market microstructure instead of staring at a chart.
Sensor data, genomics, satellite imagery: same pattern. The LLM is a reasoning engine. It's a very good one. But it reasons over text. Your job is to build the layer that translates your domain's raw signal into the structured representation the model needs.
The LLM part is the easiest piece. The translation layer is the actual engineering.
What I'd Do Differently
I started with the analysis pipeline and built detection for everything I could think of simultaneously. In hindsight, I'd start with 3-4 features per stem (pitch, energy, rhythm, reverb) and validate that the LLM can do useful reasoning with just those before adding effects detection, harmonic analysis, and envelope classification. The 40+ dimensions are impressive, but 80% of the usefulness comes from maybe 10 of them. I spent a lot of time calibrating detectors that turned out to be edge-case-heavy and low-signal.
I'd also build the calibration framework first. Having a structured way to test each detector against labeled reference tracks would have cut the iteration time in half. I did it ad hoc (decompose a track, listen, compare, adjust) when I should have built a test harness from day one.
The Takeaway
LLMs don't hear. They read. If you want one to understand something outside its native modality, you have to build the translator. For music, that means stem separation, per-stem parametric analysis, and structured output the model can parse. The same pattern works for any signal domain.
The gap between "give it to the AI" and "the AI actually understands it" is a decomposition pipeline. That pipeline is where the real engineering lives. It's also where the real value is, because once you've built it, you've given a reasoning engine access to a domain it previously couldn't touch.