Implementing a Custom String Tokenizer in JavaScript



Implementing a custom string tokenizer in JavaScript involves creating a function that splits a string into tokens based on specific rules. Tokenizers are often used in compilers, interpreters, and text processing tools. In this episode, we'll create a custom string tokenizer that can handle different delimiters and quotes.


What is a String Tokenizer?

A string tokenizer is a tool that splits a string into a sequence of tokens. Tokens are meaningful elements like words, numbers, or symbols, separated by delimiters such as spaces, commas, or custom characters.

Real Interview Insights

Interviewers might ask you to:

  • Implement a custom string tokenizer.
  • Handle various delimiters, including spaces, commas, and custom characters.
  • Handle quoted strings where delimiters within quotes are ignored.
  • Handle escape sequences within quoted strings.

Implementing a Custom String Tokenizer

Here’s an implementation of a custom string tokenizer:

function customTokenizer(str, delimiters = [' ', ',', ';'], quoteChar = '"') {
  const tokens = [];
  let currentToken = '';
  let inQuotes = false;
  let escapeNext = false;
 
  for (let i = 0; i < str.length; i++) {
    const char = str[i];
 
    if (escapeNext) {
      currentToken += char;
      escapeNext = false;
    } else if (char === '\\') {
      escapeNext = true;
    } else if (char === quoteChar) {
      inQuotes = !inQuotes;
      currentToken += char; // Include the quote character in the token
    } else if (!inQuotes && delimiters.includes(char)) {
      if (currentToken) {
        tokens.push(currentToken);
        currentToken = '';
      }
    } else {
      currentToken += char;
    }
  }
 
  if (currentToken) {
    tokens.push(currentToken);
  }
 
  return tokens;
}
Explanation:
  • Delimiter Handling: Use the provided delimiters to split the string into tokens.
  • Quote Handling: Use the provided quote character to handle quoted strings.
  • Escape Sequences: Handle escape sequences within quoted strings.
  • Edge Cases: Handle empty tokens and strings with consecutive delimiters.

Practical Examples

Consider examples with various delimiters and quoted strings:

const input1 = 'Hello, "world", how are you?';
console.log(customTokenizer(input1, [',', ' ']));
// Output: ['Hello', '"world"', 'how', 'are', 'you?']
 
const input2 = 'name="John Doe", age=30, city="New York"';
console.log(customTokenizer(input2, [',', '=', ' ']));
// Output: ['name', '"John Doe"', 'age', '30', 'city', '"New York"']
 
const input3 = 'path="C:\\Program Files\\App", version="1.0.0"';
console.log(customTokenizer(input3, [',', '=']));
// Output: ['path', '"C:\\Program Files\\App"', 'version', '"1.0.0"']

Handling Edge Cases

  1. Consecutive Delimiters: Ensure that consecutive delimiters are handled correctly.
  2. Empty Tokens: Handle cases where tokens are empty.
  3. Complex Quotes: Correctly handle quotes and escape sequences.

Enhanced Implementation with Additional Features

function customTokenizer(str, delimiters = [' ', ',', ';'], quoteChar = '"') {
  const tokens = [];
  let currentToken = '';
  let inQuotes = false;
  let escapeNext = false;
 
  for (let i = 0; i < str.length; i++) {
    const char = str[i];
 
    if (escapeNext) {
      currentToken += char;
      escapeNext = false;
    } else if (char === '\\') {
      escapeNext = true;
    } else if (char === quoteChar) {
      inQuotes = !inQuotes;
      currentToken += char; // Include the quote character in the token
    } else if (!inQuotes && delimiters.includes(char)) {
      if (currentToken || char !== ' ') { // Handle empty tokens, except for spaces
        tokens.push(currentToken);
        currentToken = '';
      }
    } else {
      currentToken += char;
    }
  }
 
  if (currentToken) {
    tokens.push(currentToken);
  }
 
  return tokens;
}
 
// Example usage with additional edge cases
const input4 = 'name="John Doe"  ,,, age=30, city="New York"';
console.log(customTokenizer(input4, [',', '=', ' ']));
// Output: ['name', '"John Doe"', '', '', '', 'age', '30', 'city', '"New York"']
 
const input5 = 'path="C:\\\\Program Files\\\\App", version="1.0.0"';
console.log(customTokenizer(input5, [',', '=']));
// Output: ['path', '"C:\\\\Program Files\\\\App"', 'version', '"1.0.0"']

Use Cases for Custom String Tokenizer

  1. Text Processing: Splitting text into words, sentences, or other meaningful elements.
  2. Parsing Configuration Files: Tokenizing configuration files with various delimiters and quoted strings.
  3. Command-Line Arguments: Parsing command-line arguments with complex quoting and escaping.