Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression to detect Company Tickers using JAVA

Tags:

regex

I am attempting to filter out company tickers from possible list of tickers.

Following code is what I got so far, I need to make the RegExp sophisticated enough that only certain pattern is passed. See example code below for more specific details.

Pattern tickerPattern = Pattern.compile("^[A-Z:\\.0-9]+$");

String[] tickerStrArr={
                    "JELK90#$",  // NOT A TICKER
                    "1",         // NOT A TICKER
                    "0",         // NOT A TICKER
                    "R",         // NOT A TICKER
                    "25.36",     // NOT A TICKER
                    "1.0",       // NOT A TICKER
                    "GOOG",      // Ticker
                    "NYSE:C",    // Ticker (with exchange code NYSE)
                    "GOOG.BY",   // Ticker (with exchange code BY)
                    "$90",       // NOT A TICKER
                    "98774",     // Ticker (because more than 4 digit long)
                    "789.BY"     // Ticker (because ends with .[A-Z]{2,2})
                   };

for(String tickerStr: tickerStrArr)
{
    Matcher matcher =tickerPattern.matcher(tickerStr);

    if(matcher.find())
    {
        System.out.println("It's a ticker=>"+matcher.group());
    }
}

Expected output

It's a ticker=>GOOG
It's a ticker=>NYSE:C
It's a ticker=>GOOG.BY
It's a ticker=>98774
It's a ticker=>789.BY

Can you formulate required RegExp which will get the above expected output?

Here are rules for my own filtering (not necessarily applicable to everyone)

  1. Only Character or Numbers could be part of ticker, no special char or currency symbol.

  2. Sometimes tickers are mentioned along with their exchange code as prefix For example => NYSE:C Or there could be two character exchange code as suffix For Example => C.BY

  3. If it is all digit then it should be more than 4 digits. (this is to rule out millions of False positives)

  4. But, if digits are mentioned along with exchange code then ticker could be less than 4 digits. Because, then we have high confidence.

Let me know if you need more details.

like image 639
Watt Avatar asked Jun 06 '26 13:06

Watt


1 Answers

The following regex will match the following.

  • Start of the String
  • PreXChangeCode: optionally match a-z, 2 to 4 times, except if there is a . somewhere later. This is to detect an invalid symbol with multiple exchange symbols
  • Stock: a-z 1 to 4 times OR digit 1-3 times followed by a period OR a digit 4 or more times
  • PostXChangeCode: Optionally match a . follow by a-z exactly 2 times.
  • End of String

Regex

 ^
 (?<PreXChangeCode>[a-z]{2,4}:(?![a-z\d]+\.))?
 (?<Stock>[a-z]{1,4}|\d{1,3}(?=\.)|\d{4,})
 (?<PostXChangeCode>\.[a-z]{2})?
 $

REY

I tested it out with REY and it correctly matches your test data with the exception for R. I included one character stock names since those are valid (R is Ryder Systems).

like image 169
Daniel Gimenez Avatar answered Jun 08 '26 14:06

Daniel Gimenez