Implementing a Custom String Tokenizer in JavaScript
Implementing a custom string tokenizer in JavaScript involves creating a function that splits a string into tokens based on specific rules. Tokenizers are often used in compilers, interpreters, and text processing tools. In this episode, we'll create a custom string tokenizer that can handle different delimiters and quotes.
What is a String Tokenizer?
A string tokenizer is a tool that splits a string into a sequence of tokens. Tokens are meaningful elements like words, numbers, or symbols, separated by delimiters such as spaces, commas, or custom characters.
Real Interview Insights
Interviewers might ask you to:
- Implement a custom string tokenizer.
- Handle various delimiters, including spaces, commas, and custom characters.
- Handle quoted strings where delimiters within quotes are ignored.
- Handle escape sequences within quoted strings.
Implementing a Custom String Tokenizer
Here’s an implementation of a custom string tokenizer:
function customTokenizer(str, delimiters = [' ', ',', ';'], quoteChar = '"') {
const tokens = [];
let currentToken = '';
let inQuotes = false;
let escapeNext = false;
for (let i = 0; i < str.length; i++) {
const char = str[i];
if (escapeNext) {
currentToken += char;
escapeNext = false;
} else if (char === '\\') {
escapeNext = true;
} else if (char === quoteChar) {
inQuotes = !inQuotes;
currentToken += char; // Include the quote character in the token
} else if (!inQuotes && delimiters.includes(char)) {
if (currentToken) {
tokens.push(currentToken);
currentToken = '';
}
} else {
currentToken += char;
}
}
if (currentToken) {
tokens.push(currentToken);
}
return tokens;
}
Explanation:
- Delimiter Handling: Use the provided delimiters to split the string into tokens.
- Quote Handling: Use the provided quote character to handle quoted strings.
- Escape Sequences: Handle escape sequences within quoted strings.
- Edge Cases: Handle empty tokens and strings with consecutive delimiters.
Practical Examples
Consider examples with various delimiters and quoted strings:
const input1 = 'Hello, "world", how are you?';
console.log(customTokenizer(input1, [',', ' ']));
// Output: ['Hello', '"world"', 'how', 'are', 'you?']
const input2 = 'name="John Doe", age=30, city="New York"';
console.log(customTokenizer(input2, [',', '=', ' ']));
// Output: ['name', '"John Doe"', 'age', '30', 'city', '"New York"']
const input3 = 'path="C:\\Program Files\\App", version="1.0.0"';
console.log(customTokenizer(input3, [',', '=']));
// Output: ['path', '"C:\\Program Files\\App"', 'version', '"1.0.0"']
Handling Edge Cases
- Consecutive Delimiters: Ensure that consecutive delimiters are handled correctly.
- Empty Tokens: Handle cases where tokens are empty.
- Complex Quotes: Correctly handle quotes and escape sequences.
Enhanced Implementation with Additional Features
function customTokenizer(str, delimiters = [' ', ',', ';'], quoteChar = '"') {
const tokens = [];
let currentToken = '';
let inQuotes = false;
let escapeNext = false;
for (let i = 0; i < str.length; i++) {
const char = str[i];
if (escapeNext) {
currentToken += char;
escapeNext = false;
} else if (char === '\\') {
escapeNext = true;
} else if (char === quoteChar) {
inQuotes = !inQuotes;
currentToken += char; // Include the quote character in the token
} else if (!inQuotes && delimiters.includes(char)) {
if (currentToken || char !== ' ') { // Handle empty tokens, except for spaces
tokens.push(currentToken);
currentToken = '';
}
} else {
currentToken += char;
}
}
if (currentToken) {
tokens.push(currentToken);
}
return tokens;
}
// Example usage with additional edge cases
const input4 = 'name="John Doe" ,,, age=30, city="New York"';
console.log(customTokenizer(input4, [',', '=', ' ']));
// Output: ['name', '"John Doe"', '', '', '', 'age', '30', 'city', '"New York"']
const input5 = 'path="C:\\\\Program Files\\\\App", version="1.0.0"';
console.log(customTokenizer(input5, [',', '=']));
// Output: ['path', '"C:\\\\Program Files\\\\App"', 'version', '"1.0.0"']
Use Cases for Custom String Tokenizer
- Text Processing: Splitting text into words, sentences, or other meaningful elements.
- Parsing Configuration Files: Tokenizing configuration files with various delimiters and quoted strings.
- Command-Line Arguments: Parsing command-line arguments with complex quoting and escaping.