33

For pandas, I'm looking for a way to write conditional values to each row in column B, based on substrings for corresponding rows in column A.

So if cell in A contains "BULL", write "Long" to B. Or if cell in A contains "BEAR", write "Short" to B.

Desired output:

A                  B
"BULL APPLE X5"    "Long"
"BEAR APPLE X5"    "Short"
"BULL APPLE X5"    "Long"

B is initially empty: df = pd.DataFrame([['BULL APPLE X5',''],['BEAR APPLE X5',''],['BULL APPLE X5','']],columns=['A','B'])

P A N
  • 5,642
  • 15
  • 52
  • 103
  • This is a very good question. The column B desired is Binary what if you have to assign B based on string search. – user3341078 Sep 22 '17 at 18:52

4 Answers4

37

Your code would error as you creating the Dataframe incorrectly, just create a single column A then add B based on A:

import pandas as pd
df = pd.DataFrame(["BULL","BEAR","BULL"], columns=['A'])
df["B"] = ["Long" if ele  == "BULL" else "Short" for ele in df["A"]]

print(df)

    A      B
0  BULL   Long
1  BEAR  Short
2  BULL   Long

Or do you logic with the data before you create the dataframe:

import pandas as pd
data = ["BULL","BEAR","BULL"]
data2 = ["Long" if ele  == "BULL" else "Short" for ele in data]
df = pd.DataFrame(list(zip(data, data2)), columns=['A','B'])

print(df)
      A      B
 0  BULL   Long
 1  BEAR  Short
 2  BULL   Long

For your edit:

df = pd.DataFrame([['BULL APPLE X5',''],['BEAR APPLE X5',''],['BULL APPLE X5','']], columns=['A','B'])

df["B"] = df["A"].map(lambda x: "Long" if "BULL" in x else "Short" if "BEAR" in x else "")

print(df)

            A      B
0  BULL APPLE X5   Long
1  BEAR APPLE X5  Short
2  BULL APPLE X5   Long

Or just add the column after:

df = pd.DataFrame(['BULL APPLE X5','BEAR APPLE X5','BLL APPLE X5'], columns=['A'])

df["B"] = df["A"].map(lambda x: "Long" if "BULL" in x else "Short" if "BEAR" in x else "")

print(df)

Or using contains:

df = pd.DataFrame([['BULL APPLE X5',''],['BEAR APPLE X5',''],['BULL APPLE X5','']], columns=['A','B'])


df["B"][df['A'].str.contains("BULL")] = "Long"
df["B"][df['A'].str.contains("BEAR")] = "Short"

print(df)
0  BULL APPLE X5   Long
1  BEAR APPLE X5  Short
2  BULL APPLE X5   Long
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • this only works when not being at Bull you must be a Bear. But this is not exactly the condition of the OP – joaquin Jun 20 '15 at 11:32
  • @joaquin, the OP does not say there is any other possibility but if there were it would not be too hard to add the logic for it. – Padraic Cunningham Jun 20 '15 at 11:36
  • @PadraicCunningham Thank you, I have corrected the OP with the correct DataFrame as to not confuse future readers. You may edit your answer if you like. Will try your solution – P A N Jun 20 '15 at 11:37
  • @Winterflags, are there other possibilities beside BULL and BEAR? – Padraic Cunningham Jun 20 '15 at 11:41
  • 1
    Absolutely, pretty easy to correct. Just I wanted to note that "if cell in A contains "BEAR"" is not the same as " if cell in A doesnt contain "Bull"". – joaquin Jun 20 '15 at 11:42
  • @PadraicCunningham BULL and BEAR will be the selectors for this particular function, but technically they are embedded in other string content so I will have to use `str.contains` or similar. – P A N Jun 20 '15 at 11:43
  • 1
    @Winterflags so you are matching substrings? – Padraic Cunningham Jun 20 '15 at 11:44
  • Yes, the full value for cell in A may be `BULL APPLE X5` for instance. I should've made more clear that I intended for `contains` although it shouldn't change the fundamentals too much. Sorry about that. – P A N Jun 20 '15 at 11:47
  • @PadraicCunningham Thanks a lot! Both solutions work beautifully. Is `str.contains` faster than `.map(lambda x:`? – P A N Jun 20 '15 at 12:07
  • @Winterflags, I imagine unutbu's might be the fastest but if you can share a link to the data I will run a few tests and see how each performs – Padraic Cunningham Jun 20 '15 at 12:21
  • @PadraicCunningham That would probably be too much work to ask of you :) Either one works fine, but `.map(lambda x: ` coincidentally did run a bit jerkily and a .Ds_Store file was created in the folder where I read/write csv to. Not sure what happened there but probably unrelated to the code! Thanks for all your help. – P A N Jun 20 '15 at 12:26
  • @Winterflags, no prob, I don't mind adding the timings but it is up to yourself. – Padraic Cunningham Jun 20 '15 at 12:27
10

Also, for populating the df['B'] you can try the below method -

def applyFunc(s):
    if s == 'BULL':
        return 'Long'
    elif s == 'BEAR':
        return 'Short'
    return ''

df['B'] = df['A'].apply(applyFunc)
df
>>
       A      B
0  BULL   Long
1  BEAR  Short
2  BULL   Long

What the apply function does, is that for each row value of df['A'] , it calls the applyFunc function with the parameter as the value of that row , and the returned value is put into the same row for df['B'] , what really happens behind the scene is a bit different though, the value is not directly put into df['B'] but rather a new Series is created and at the end, the new Series is assigned to df['B'] .

Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
  • Thank you, I have corrected the OP with the correct DataFrame as to not confuse future readers. You may edit your answer if you like. – P A N Jun 20 '15 at 11:37
7

You could use str.extract to search for regex pattern BULL|BEAR, and then use Series.map to replace those strings with Long or Short:

In [50]: df = pd.DataFrame([['BULL APPLE X5',''],['BEAR APPLE X5',''],['BULL APPLE X5','']],columns=['A','B'])

In [51]: df['B'] = df['A'].str.extract(r'(BULL|BEAR)').map({'BULL':'Long', 'BEAR':'Short'})

In [55]: df
Out[55]: 
               A      B
0  BULL APPLE X5   Long
1  BEAR APPLE X5  Short
2  BULL APPLE X5   Long

However, forming the intermediate Series with str.extract is quite slow compared to df['A'].map(lambda x:...). Using IPython's %timeit to time the benchmarks,

In [5]: df = pd.concat([df]*10000)

In [6]: %timeit df['A'].str.extract(r'(BULL|BEAR)').map({'BULL':'Long', 'BEAR':'Short'})
10 loops, best of 3: 39.7 ms per loop

In [7]: %timeit df["A"].map(lambda x: "Long" if "BULL" in x else "Short" if "BEAR" in x else "")
100 loops, best of 3: 4.98 ms per loop

The majority of time is spent in str.extract:

In [8]: %timeit df['A'].str.extract(r'(BULL|BEAR)')
10 loops, best of 3: 37.1 ms per loop

while the call to Series.map is relatively fast:

In [9]: x = df['A'].str.extract(r'(BULL|BEAR)')

In [10]: %timeit x.map({'BULL':'Long', 'BEAR':'Short'})
1000 loops, best of 3: 1.82 ms per loop
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
0

Alternatively you can use np.select if you don't mind using NumPy:

import numpy as np

df['B'] = np.select(
    [df['A'].str.contains('BULL'), df['A'].str.contains('BEAR')],
    ['Long', 'Short'],
    default=np.nan,
)

If neither BULL or BEAR is found, NaN is returned.

rachwa
  • 1,805
  • 1
  • 14
  • 17