T-SQL Tuesday 191 – String parsing (and why I hate intelligent keys)

9 months ago

Welcome to the delayed 191^st T-SQL Tuesday. Our host is, once more, Steve Jones. This time, he asks us to write about our favorite string parsing. Which is an odd thing to ask: nobody ever said “I like doing string manipulation in T-SQL”. So instead of doing what Steve asked, I will provide some background, and they show some tricks that might make the task a bit less unpleasant.

Why intelligent keys are anything but intelligent

I am currently most known for my performance tuning and execution plan work. But when I started working with database, I actually came from a background of data modelling, database design, and normalization. And that has never fully left me. In fact, I have in the past two years created a whole series of YouTube videos about database design and normalization. And a much longer time ago, I recorded a Pluralsight course on this topic that is still available for viewing.

One of the very basics of schema design for a relational database is to store atomic values in every column. One column, one value, no more, no less. That automatically rules out all repeating groups. So a single column to list all my email addresses? Sorry. You are doing it wrong. And you will pay the price when you try to protect the integrity of your data. Or even just report on it.

But this atomicity rule also rules out composite values in a column. If your application needs to use surname and initials individually, then you must use two columns, one for the surname and another for the initials. That automatically prevent removes the necessity to write awkward string parsing in your code where you try to separate the two.

And that is why I have a thorough hatred of what is commonly known as “intelligent keys”. Granted, I can see why end users like the concept. For humans, identifying a subscription renewal by customer number, magazine code, and year is a lot easier than having to work with a randomly generated 8-digit number. Renewal “10451-PeMa-2025” is instantly recognizable as the 2025 subscription to People Magazine by customer 10451. Renewal “95023954” is just a meaningless number, and requires additional work to look up the customer, magazine, and year.

And just to be clear, I am fully okay with natural keys. In the example above, I would definitely declare a UNIQUE or PRIMARY KEY constraint on the columns for CustomerNumber, MagazineCode, and Year. But as three individual columns. And then I would, depending on the situation, decide whether or not to also add a generated and meaningless surrogate key – that should not be an automatic for each table, but decided on a case by case basis. But even if I add it, I would still keep the business key, and constrain it to ensure no violations ever happen.

The problems for the database start when people store this human-readable intelligent key as a single column in the database. And the problems then get worse when someone decides to save space in the database by now removing the CustomerID, MagazineCode, and Year columns from the Renewals table, “because they can be inferred from the RenewalKey”.

Dear developers. I get that you never had a course on database design. That is okay, you spent time learning things I cannot do. But would you fly a plane without training, or perform open heart surgery without the proper qualification? If you never learned how to design databases, then don’t. Leave it to the professionals. Or learn the ropes first.

Dear developers. I get that your users like this intelligent key, this single value on their screen that includes all relevant information. But that’s presentation. We can present data in every way you can imagine, and a few you had never thought of. Presentation is done in the presentation layer.

Dear developers. Databases should store atomic values. “10451-PeMa-2025” is not atomic, it is three atomic values smashed together in a single field. Nothing wrong with this in a report, on a screen, or on a shipping label. But everything goes to hell in a handbasket if you store this as such in the database.

Parsing tricks

But of course, you, the reader of this blog, are a smart person. You would never commit such design sins. All your databases are perfectly normalized, with no composite column anywhere to be seen. You just suffer from other people’s issues. Bad designs from other departments. Code smells left by your predecessor. Third party products with a database schema that does not even warrant the name “design”. And now you have to deal with this. Here are some tricks that you might be able to use.

PARSENAME

The PARSENAME function is intended for a very specific function: to get the individual parts of a fully qualified object name in SQL Server. However, who says we are not allowed to abuse this function for other purposes?

You will need to work a bit on this. For starters, “10451-PeMa-2025” uses dashes as the separator. PARSENAME expects periods. So we need to replace these, and we need to be aware that we can now get errors if there are periods in any of the components. But if there are not, then a REPLACE will do the job just fine.

The second thing you need to do is to count the number of components. If there are more than four, PARSENAME can’t be used. In this case, there are three. That means that PARSENAME will assume that you specify database, schema, and object name. The object name (year) is last, but requires the value 1 for the object piece. In the same way, the MagazineCode would be object piece 2, and object piece 3 is then the CustomerID.

Finally, you need to be aware that, since PARSENAME thinks it works on SQL Server object names, it will remove [] or “” around the components. And if they are in the components but not fully around them, you might get NULL results. So basically, just don’t use PARSENAME if the intelligent key can contain any of the characters [, ], or “.

However, very often you will find that none of these potential obstacles apply. In that case, you can in fact use code such as the below to find the components of this intelligent key.

SELECT PARSENAME (REPLACE ('10451-PeMa-2025', '-', '.'), 1) AS Year;
SELECT PARSENAME (REPLACE ('10451-PeMa-2025', '-', '.'), 2) AS MagazineCode;
SELECT PARSENAME (REPLACE ('10451-PeMa-2025', '-', '.'), 3) AS CustomerID;

Fixed length

Of course, if we know that the length of each component of the intelligent key is always the same, then we can very easily use the SUBSTRING function. And for the first or last component, we can even shorten this to use LEFT or RIGHT.

SELECT SUBSTRING ('10451-PeMa-2025', 1, 5) AS Year;
SELECT SUBSTRING ('10451-PeMa-2025', 7, 4) AS MagazineCode;
SELECT SUBSTRING ('10451-PeMa-2025', 12, 4) AS CustomerID;

SELECT LEFT('10451-PeMa-2025', 5) AS CustomerID;
SELECT RIGHT('10451-PeMa-2025', 4) AS Year;

Now if you need even more complex stuff, such as for instance creating a new intelligent key where the MagazineCode is removed, but the customer code and year are still there (though I honestly don’t know why anyone would do that!!), then people often use the expressions above and concatenate them back together. But that is not needed. A much more intelligent solution is to use the STUFF function, which replaces part of a string (as indicated by start and end position) with a new replacement string … which can be empty!

SELECT STUFF ('10451-PeMa-2025', 7, 4, '') AS NewUnintelligentKey;

Much shorter than using a LEFT, a RIGHT, and then concatenating all of that together!

Varying length

But what if components can be of varying length? In our example, it is easy to imagine that the CustomerID might be just four positions for some customers, and the MagazineCode might be anything from three to seven characters. The year will of course always be four characters. But let’s for the sake of discussion assume that even that component can vary in length.

If you can use PARSENAME, then this does not matter. But there are restrictions to PARSENAME. So sometimes we need to use LEFT, RIGHT, SUBSTRING, and STUFF. But then, how do we find the correct starting position and/or length parameters to pass into these functions?

This is where the CHARINDEX comes in. But CHARINDEX only returns the first position it finds. If we need the second, then we need a second CHARINDEX, that starts searching after the first, by specifying a start position. But that start position then also requires a CHARINDEX call, and very soon, we get nightmarish constructions.

SELECT CHARINDEX ('-', '10451-PeMa-2025') AS FirstSeperator;
SELECT CHARINDEX ('-', '10451-PeMa-2025', CHARINDEX ('-', '10451-PeMa-2025') + 1) AS SecondSeperator;

Note that this is only to find the position of the first and second separator. To actually find the MagazineCode, we need to use these in a SUBSTRING expression such as this:

SELECT SUBSTRING ('10451-PeMa-2025',
                  CHARINDEX ('-', '10451-PeMa-2025') + 1,
                  CHARINDEX ('-',
                             '10451-PeMa-2025',
                             CHARINDEX ('-', '10451-PeMa-2025') + 1)
                - CHARINDEX ('-', '10451-PeMa-2025') - 1) AS MagazineCode;

And that is only for the second component! Imagine having an intelligent key that has seven components, and you need to extract the sixth!

Sometimes, you can simplify these expressions somewhat by using the REVERSE function. For instance, if you want just the Year, then these two expressions both give you that. I know which one I prefer!

SELECT RIGHT('10451-PeMa-2025', LEN ('10451-PeMa-2025')
                              - CHARINDEX ('-',
                                           '10451-PeMa-2025',
                                           CHARINDEX ('-', '10451-PeMa-2025') + 1)) AS Year;
SELECT RIGHT('10451-PeMa-2025', CHARINDEX ('-', REVERSE ('10451-PeMa-2025')) - 1) AS Year;

Regex?

Even with these tricks, string parsing tends to result in code that is ugly, long, hard to maintain, and error prone. So I guess it is a good thing that SQL Server 2025 will finally implement support for regular expressions.

Now, I really don’t like regex. Those expressions, too, are complex and hard to understand. And it will take me time to learn that syntax.

However, seeing the expressions above, which are even still relatively simple, makes the choice for me very easy. Don’t use intelligent keys, ever! But if that doesn’t work … then I will learn regex, just to avoid this code mess for future projects.

Hugo Kornelis

Hugo Kornelis is an established SQL Server Community expert. He writes, blogs, speaks, tech edits, and researches, focusing mostly on SQL Server performance and execution plans. He was the technical editor for the third edition of Grant Fritchey's "SQL Server Execution Plans". In 2018 he started a project to document all behaviour of SQL Server execution plans at his website, "the SQL Server Execution Plan Reference" (https://sqlserverfast.com/epr) When not working for the community, Hugo is busy at his day job: freelance database developer/consultant. Hugo has over 20 years of experience on SQL Server in various roles. He has a strong database design background but has since specialized into query tuning and execution plans.

Black Friday 2025

A deep dive into hash tables, part 4

No results found.

13 Comments. Leave new

Robert Eder
October 22, 2025 15:41

I agree with the many issues with parsing intelligent keys. It becomes very painful when a join is done on the parsed value. However, I think it is acceptable to use an intelligent key provided that it is never parsed, which greatly limits the possible use cases. One that I am aware of is using an intelligent key for date (YYYYMMDD) and time (hhmmss) with calendar tables for date and time in a data warehouse. What do you think?

Loading...

Reply
- Hugo Kornelis
  December 15, 2025 11:03
  
  Hi Robert!
  Apologies for the delayed reply. WordPress notifications apparently don’t always work.
  If an “intelligent” key is never parsed, then how exactly is it intelligent? Then it is just a unique value with no other intrinsic meaning than the identifier of a single row. If you mean, it has meaning to the humans using the system, but not in the application, then you are treading a dangerous path. What if one day the requirements change and we do need to start using one components of the intelligent key?
  For a key columns that has date and time, I would suggest using the datetime2(0) data type – the data type for an atomic “moment in time” data. Or, if the date and/or time parts need to be used individually, a composite key of two columns, with data types date and time(0).
  In a data warehouse, rules might be somewhat different. I never had formal training on data modelling for data warehouses (Kimbal etc). So I don’t know whether it would be okay to store date + time in any data type other than (small)datetime or datetime2.
  
  Loading...
  
  Reply
Bryant
October 22, 2025 17:15

As usual, Hugo, I agree with you. OTOH I did have design training. I might even dream in normal form. 😉

I am also an advocate of atomic pretty much everything. Learning how to differentiate presentation from data management is something that has to be learned by all developers, database or otherwise.

And you know I tend to read with a proofreader’s eye. I think the following is missing an important period between “don’t” and “Leave” (paragraph 8):
“If you never learned how to design databases, then don’t Leave it to the professionals.”

Loading...

Reply
- Hugo Kornelis
  December 15, 2025 11:06
  
  Thanks for the reply, Bryant! And my apologies that I never saw this before.
  Dreaming in normal form? That sounds pretty bad. I already suffer from dreaming in execution plans LOL.
  Thanks also for catching that typo. I have added the missing period.
  
  Loading...
  
  Reply
Toby
December 15, 2025 00:28

Thanks for this. I wish more people understood the value of database design!

I’ve made use of PARSENAME a couple of times to break down strings, and it has proved to be effective.

One option you make no mention of is STRING_SPLIT, which with the new ordinal value allows some neat tricks.

select value as Year
from (select value from string_split(‘10451-PeMa-2025’, ‘-‘, 1) where ordinal = 3) split

Loading...

Reply
- Hugo Kornelis
  December 15, 2025 11:13
  
  Thanks, Toby!
  You are right. I had not thought about STRING_SPLIT. Do note that the version you use is only supported since SQL Server 2022, when the optional third parameter was introduced. On SQL Server 2019 and before, there was no easy way to get the third result from the STRING_SPLIT function.
  
  Loading...
  
  Reply
  - Brad Schulz
    December 15, 2025 19:46
    
    Hi Hugo…
    
    At least for readability in splitting strings, you could use CROSS APPLY. See my example in the section entitled “Perform Complicated Calculations” in an old blog post of mine from (yikes!) 2011:
    
    https://bradsruminations.blogspot.com/search/label/CROSS%20APPLY
    
    Loading...
    
    Reply
- Thomas Franz
  December 16, 2025 14:51
  
  you have to be carefull with STRING_SPLIT when working with large datasets, since it multiply the number of rows and there may be a difference between handling 1 mio or 3 mio rows at once – at least regarding to memory grants etc.
  
  Loading...
  
  Reply
Sean Redmond
December 15, 2025 10:10

How do you feel about JSON- & XML-strings being stored in the database?

Loading...

Reply
- Hugo Kornelis
  December 15, 2025 11:17
  
  Hey Sean!
  Well, in the context of this post: please do not ever use an xml or json column as a key column!
  In generic, there are use cases for storing xml and json in a database. But beware. Just because you now have a nice hammer does not mean that every problem is a nail. If you have to do a lot of filtering on these columns, or often have to extract parts of it, then perhaps you get better performance by shredding the data in a relational schema and using that representation instead. If you can’t put it in a relational schema because the data is unstructured or semi-structured, then use a non-relational database that is designed to deal with that kind of data. (And with the challenge of filtering on ore returning elements that might not exist, or exist in a different node)
  If it’s mostly storage, and occasionally a small bit of parsing, then you might be okay. Once you start considering xml indexes or json indexes for performance … well, it’s great that those features exist in case you need them, but it’s also a sign you might be using the wrong type of database.
  
  Loading...
  
  Reply
Jan C. de Graaf - Blijleven
December 15, 2025 19:18

Hello Hugo. Good post. I fully agree. This should be repeated until everyone ever designing a relational database schema knows it and adheres to it.

p.s. Thanks to Brent, I found your article. Why haven’t I heard from you before? I clearly have some catching up to do.

Loading...

Reply
Scott Peters
December 16, 2025 02:32

Food for thought but date and time columns are composite values.

And there is also hierarchy naming such as “server.database.object” (which has the function PARSENAME to extract).

Loading...

Reply
- Hugo Kornelis
  December 16, 2025 13:14
  
  Thanks, Scott.
  This is something I decided not to touch on in this blog. But I discuss it at length in my YouTube video on First Normal Form. When is a value atomic? And the answer is: it depends.
  On what? you now ask. Very smart! The answer now is, it depends on the Universe of Discourse, on how the values will be used.
  If you need to store both date and time, then you need to look at how the application uses this data. Is it just a timestamp for later reference? Is it used to e.g. compute next renewal date by adding a full year every time? In other words, are date and time always used as a single inseparable value to denote a moment in time? Then the moment is the atomic value, and the correct design uses a single column with a datetime or datetime2 data type.
  On the other hand, if you often need just the day (give me all subscriptions that renew tomorrow), or just the time (I need to find subscriptions that expire at exactly midnight on any given date), then you are right that the moment is made up of two separate attributes, date and time, that each carry individual meaning.
  Which of these is correct? That depends on the UoD, the universe of discourse for the company that will use the database. There is no universal truth. There is no single design to rule them all.
  That’s why database modellers will still have a job going forward, to fix the mistake made by an AI that does not know the individual business’s context.
  
  Loading...
  
  Reply