Coder Perfect

What is the best way to get a string after a substring?

Problem

What is the best way to get a string after a substring?

For instance, I’d like to obtain the string that comes after “world” in

my_string="hello python world, I'm a beginner"

…which in this context means “, I’m a novice”)

Asked by havox

Solution #1

The simplest method is to just split on your target word.

my_string="hello python world , i'm a beginner "
print my_string.split("world",1)[1] 

split takes a word (or character) to split on, as well as a limit on the number of splits that can be done.

In this case, divide on “world” and limit the number of splits to one.

Answered by Joran Beasley

Solution #2

s1 = "hello python world , i'm a beginner "
s2 = "world"

print s1[s1.index(s2) + len(s2):]

Use s1 to deal with the circumstance when s2 isn’t present in s1. instead of index, use find(s2). If the call’s return value is -1, then s2 isn’t in s1.

Answered by arshajii

Solution #3

I’m amazed no one brought up partition.

def substring_after(s, delim):
    return s.partition(delim)[2]

This solution, in my opinion, is more readable than @arshajii’s. Aside from that, I believe @arshajii’s is the greatest in terms of speed because it does not make any needless copies or substrings.

Answered by shx2

Solution #4

str.partition() is what you want to use:

>>> my_string.partition("world")[2]
" , i'm a beginner "

since it is a faster option than the alternatives

If the delimiter is not present, the result is an empty string:

>>> my_string.partition("Monty")[2]  # delimiter missing
''

If you want the original string, make sure the second value given by str.partition() isn’t null:

prefix, success, result = my_string.partition(delimiter)
if not success: result = prefix

Alternatively, you might use str.split() with a limit of 1:

>>> my_string.split("world", 1)[-1]
" , i'm a beginner "
>>> my_string.split("Monty", 1)[-1]  # delimiter missing
"hello python world , i'm a beginner "

However, this option is slower. For a best-case scenario, str.partition() is easily about 15% faster compared to str.split():

                                missing        first         lower         upper          last
      str.partition(...)[2]:  [3.745 usec]  [0.434 usec]  [1.533 usec]  <3.543 usec>  [4.075 usec]
str.partition(...) and test:   3.793 usec    0.445 usec    1.597 usec    3.208 usec    4.170 usec
      str.split(..., 1)[-1]:  <3.817 usec>  <0.518 usec>  <1.632 usec>  [3.191 usec]  <4.173 usec>
            % best vs worst:         1.9%         16.2%          6.1%          9.9%          2.3%

The delimiter is either missing (worst-case scenario), placed first (best-case scenario), or in the bottom half, upper half, or last position in this diagram. […] denotes the fastest time, whereas…> denotes the worst.

The table above is the result of a thorough time trial of all three options, which is shown below. On a 2017 model 15″ Macbook Pro with 2.9 GHz Intel Core i7 and 16 GB RAM, I did the tests with Python 3.7.4.

This script generates random sentences with and without the randomly selected delimiter present, and if present, at different positions in the generated sentence, then runs the tests in random order with repeats (providing the most accurate results while accounting for random OS events that occur during testing), and then prints a table of the results:

import random
from itertools import product
from operator import itemgetter
from pathlib import Path
from timeit import Timer

setup = "from __main__ import sentence as s, delimiter as d"
tests = {
    "str.partition(...)[2]": "r = s.partition(d)[2]",
    "str.partition(...) and test": (
        "prefix, success, result = s.partition(d)\n"
        "if not success: result = prefix"
    ),
    "str.split(..., 1)[-1]": "r = s.split(d, 1)[-1]",
}

placement = "missing first lower upper last".split()
delimiter_count = 3

wordfile = Path("/usr/dict/words")  # Linux
if not wordfile.exists():
    # macos
    wordfile = Path("/usr/share/dict/words")
words = [w.strip() for w in wordfile.open()]

def gen_sentence(delimiter, where="missing", l=1000):
    """Generate a random sentence of length l

    The delimiter is incorporated according to the value of where:

    "missing": no delimiter
    "first":   delimiter is the first word
    "lower":   delimiter is present in the first half
    "upper":   delimiter is present in the second half
    "last":    delimiter is the last word

    """
    possible = [w for w in words if delimiter not in w]
    sentence = random.choices(possible, k=l)
    half = l // 2
    if where == "first":
        # best case, at the start
        sentence[0] = delimiter
    elif where == "lower":
        # lower half
        sentence[random.randrange(1, half)] = delimiter
    elif where == "upper":
        sentence[random.randrange(half, l)] = delimiter
    elif where == "last":
        sentence[-1] = delimiter
    # else: worst case, no delimiter

    return " ".join(sentence)

delimiters = random.choices(words, k=delimiter_count)
timings = {}
sentences = [
    # where, delimiter, sentence
    (w, d, gen_sentence(d, w)) for d, w in product(delimiters, placement)
]
test_mix = [
    # label, test, where, delimiter sentence
    (*t, *s) for t, s in product(tests.items(), sentences)
]
random.shuffle(test_mix)

for i, (label, test, where, delimiter, sentence) in enumerate(test_mix, 1):
    print(f"\rRunning timed tests, {i:2d}/{len(test_mix)}", end="")
    t = Timer(test, setup)
    number, _ = t.autorange()
    results = t.repeat(5, number)
    # best time for this specific random sentence and placement
    timings.setdefault(
        label, {}
    ).setdefault(
        where, []
    ).append(min(dt / number for dt in results))

print()

scales = [(1.0, 'sec'), (0.001, 'msec'), (1e-06, 'usec'), (1e-09, 'nsec')]
width = max(map(len, timings))
rows = []
bestrow = dict.fromkeys(placement, (float("inf"), None))
worstrow = dict.fromkeys(placement, (float("-inf"), None))

for row, label in enumerate(tests):
    columns = []
    worst = float("-inf")
    for p in placement:
        timing = min(timings[label][p])
        if timing < bestrow[p][0]:
            bestrow[p] = (timing, row)
        if timing > worstrow[p][0]:
            worstrow[p] = (timing, row)
        worst = max(timing, worst)
        columns.append(timing)

    scale, unit = next((s, u) for s, u in scales if worst >= s)
    rows.append(
        

[f”{label:>{width}}:”, *(f” {c / scale:.3f} {unit} ” for c in columns)]

) colwidth = max(len(c) for r in rows for c in r[1:]) print(‘ ‘ * (width + 1), *(p.center(colwidth) for p in placement), sep=” “) for r, row in enumerate(rows): for c, p in enumerate(placement, 1): if bestrow[p][1] == r: row[c] = f”[{row[c][1:-1]}]” elif worstrow[p][1] == r: row[c] = f”<{row[c][1:-1]}>” print(*row, sep=” “) percentages = [] for p in placement: best, worst = bestrow[p][0], worstrow[p][0] ratio = ((worst – best) / worst) percentages.append(f”{ratio:{colwidth – 1}.1%} “) print(“% best vs worst:”.rjust(width + 1), *percentages, sep=” “)

Answered by Martijn Pieters

Solution #5

If you want to do this using regex, you could simply use a non-capturing group, to get the word “world” and then grab everything after, like so

(?:world).*

This is when the example string is put to the test.

Answered by Tadgh

Post is based on https://stackoverflow.com/questions/12572362/how-to-get-a-string-after-a-specific-substring