0

I understand the question title is vague. My apologies. I have a hashmap that has the key:value <string>:<list of lists>. For a given list, each of the items in the list has a corresponding probability of being chosen. For example, one item in the hashmap might look like

"NP":[['A', 'B'], ['C', 'D', 'E'], ['F']]

I need to choose one of the lists on the right hand side randomly. Each list has it's own probability. Here are the input strings that would generate the above item in the map.

3 NP A B
1 NP C D E
1 NP F

Because the line NP A B has the number 3 next to it, NP C D E has 1 next to it, and NP F has 1 next to it, the probability ratio is 3:1:1, so [A, B] has 3/5 probability of being chosen, [C, D, E] has 1/5 of being chosen and same for [F.

My question is, how do I simulate those probabilities?

Before those numbers were introduced it was easy because I could count the length of the list (in the above example it would be 3) and then choose a random number between 0 and len(list) - 1 inclusive with random.randint(), then choose that index from the list. To simulate bernoulli random variables, I know that one can check if random.randint() < p. But that only works if you have 2 cases. I can't explicitly write if statements to check because a list might have n elements.

Jeremy Fisher
  • 2,510
  • 7
  • 30
  • 59
  • 1
    This is trivial using [numpy.random.choice](http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html). If you don't want to use numpy, than you have to implement your own **discrete-sampling algorithm**. Look at wiki for some ideas. A good one, but harder to implement is the **alias-algorithm**, but you could also do it using binary-search (or even linear-search). – sascha Sep 19 '16 at 21:38
  • yeah unfortunately i can't use numpy or libraries that the user needs to install – Jeremy Fisher Sep 19 '16 at 21:40
  • See [Weighted random generation in Python](http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python) by Eli Bendersky. – Steven Rumbalski Sep 19 '16 at 21:46
  • Possible duplicate of [Select k random elements from a list whose elements have weights](http://stackoverflow.com/questions/2140787/select-k-random-elements-from-a-list-whose-elements-have-weights) – Barmar Sep 19 '16 at 22:05
  • @StevenRumbalski so in that article would the `weights` list I pass in be `[3, 1, 1]` in my example or `[60, 20, 20]` where each element is calculated from `[(3/5)*100, (1/5)*100, (1/5)*100]` – Jeremy Fisher Sep 20 '16 at 02:24

3 Answers3

0

So the way I would solve this is by creating a sparse table ranging from 0 to total probability. In your case, that's

0 -> 0
3 -> 1
4 -> 2

Then pick an int between 0 and 4, and pick the largest value >= the chosen value (in other words, 1 maps to 0, 2 maps to 0, 3 maps to 1). The 'value' in this mapping corresponds to the sublist in your original dictionary. That shouldn't take any additional libraries.

Eric
  • 121
  • 5
0

Here is a crude approach that might suffice if your total weight remains small:

>>> NP = [['A', 'B'], ['C', 'D', 'E'], ['F']]
>>> weights = (3,1,1)
>>> indx_list = [idx for idx,w in zip(range(len(NP)), weights) for _ in range(w)]
>>> indx_list
[0, 0, 0, 1, 2]
>>> import random
>>> random.choice([0, 0, 0, 1, 2])
1
>>> sample = [random.choice([idx for idx,w in zip(range(len(NP)), weights) for _ in range(w)]) for _ in range(1000)]
>>> from collections import Counter
>>> counts = Counter(sample)
>>> counts
Counter({0: 600, 2: 213, 1: 187})
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
0

Here is a simple prototype which uses linear-search. The only dependency is random.random() to obtain a float within [0,1).

Despite the unoptimized approach, it only needs ~0.25 seconds for 100.000 samples on my PC. But keep in mind, that this performance is dependent on the statistics / probability-vector. The code could also be improved through presorting.

For the general idea: check this.

Code

import random

""" Discrete-sampling """
def cum_sum(xs):
    cum_sum = []
    total = 0
    for i in xs:
        total += i
        cum_sum.append(total)
    total_sum = sum(cum_sum)
    return cum_sum, cum_sum[-1]

def discrete_sample(items, probs):
    cum_sum_, max_ = cum_sum(probs)
    random_val = random.random() * max_
    for ind, i in enumerate(items):
        if random_val < cum_sum_[ind]:
            return i
    return items[-1]  # fail-safe

def sample_from_dict(element, data, data_p):
    data_ = data[element]
    data_p_ = data_p[element]
    selection = discrete_sample(range(len(data_)), data_p_)
    return data_[selection]

""" Data """
data = {'NP': [['A', 'B'], ['C', 'D', 'E'], ['F']]}
data_p = {'NP': [3, 1, 1]}

""" Try it """
samples = []
for i in range(100000):
    samples.append(sample_from_dict('NP', data, data_p))

counts = [0, 0, 0]
for i in samples:
    if i == ['A', 'B']:
        counts[0] += 1
    elif i == ['C', 'D', 'E']:
        counts[1] += 1
    elif i == ['F']:
        counts[2] += 1

print(counts)

Output

[60130, 19867, 20003]
sascha
  • 32,238
  • 6
  • 68
  • 110