Detecting direction of sound using several microphones

Question

First of all, I've seen a similar thread, however it's a bit different to what I'm trying to achieve. I am constructing a robot which will follow the person who calls it (3D sound localization). My idea is to use 3 or 4 microphones - i.e. in the following arrangement in order to determine the from which direction the robot was called:

Where S is source, A, B and C are microphones. The idea is to calculate phase correlation of signals recorded from pairs AB, AC, BC and based on that construct a vector that will point at the source using a kind of triangulation. The system does not even have to work in real time because it will be voice activated - signals from all the microphones will be recorded simultaneously, voice will be sampled from only one microphone and if it fits the voice signature, phase correlation will be computed from the last fraction of second in order to compute the direction. I am aware that this might not work too well i.e. when the robot is called from another room or when there are multiple reflections.

This is just an idea I had, but I have never attempted anything like this and I have several questions before I construct the actual hardware that will do the job:

Is this a typical way of doing this? (i.e. used in phones for noise cancellation?) What are other possible approaches?
Can phase correlation be calculated between 3 sources simultaneously somehow? (i.e. in order to speed up the computation)
Is 22 khz sample rate and 12bit depth sufficient for this system? I am especially concerned about the bit depth.
Should the microphones be placed in separate tubes in order to improve separation?

Here's an interesting article, maybe you've seen it. It looks like the author ended up putting a fourth mic above the other 3 in order to deal with the sound source being above the array. Other than that it looks pretty similar to your plan (to my untrained eye, at least). — Guest, Feb 25 '18 at 20:42
The general term for the phase correlation part is Beamforming. A common beamforming system uses a linear array of microphones, and I'm not sure the field of "vision" for your microphones will really allow for much triangulation. — pscheidler, Feb 26 '18 at 04:44
Regarding triangulation, I guess you could set up two or three of the arrays some distance apart and find the intersection of the beams. Could solve 2-beam degenerate case with "hey robot..." (robot turns to face you)... "come here!" — Guest, Feb 26 '18 at 18:30
Actually, that could work by adding one more mic. Check this out, it's a variation of Harry's solution. The equilateral triangle becomes a right triangle, and one more mic is added to form another triangle. From each triangle we cast a beam, and take the average of those two beams to get an accurate direction vector. Notice the two "eyes" in the demo. They're placed so that the beams running through them will triangulate position when source is directly in front of or behind robot. Try it out with source at any y=0. — Guest, Feb 27 '18 at 05:36
Calculating energy peaks at each microphone wouldn't be sufficient? (this is really not my area) — Filipe Pinto, Jul 03 '19 at 11:54
@FilipePinto have you read the answers and the description of the problem thoroughly? It can't really work like that since you can't know how each energy peak from each microphone is correlated with other microphones - that's why you need phase correlation, iterative closest point or some other registration algorithm (registration doesn't refer to recording here, but to matching one signal against another) to match recorded waveforms and detect their mutual shift within some time window — Max Walczak, Jul 03 '19 at 13:22
I was working on a similar project recently but still without succsess. I will follow your discussion. Just want to point one mistake here: > speed of soundsound frequency=343 m/s / 6 kHz = 5.71 mm. It is actually 57.1mm, so be carefull when building a mics platform :). TB — Tobajer, Sep 25 '19 at 13:38
@pscheidler A linear array would not work, it can only fit the source to a hyperboloid locus, which is not very helpful. It has no ability to tell the difference between a sound coming from one side of the array vs the other side vs an angle from above. 3 noncollinear mics pins it down to a 2-point locus (one above the floor and one below), so that would be sufficient. — endolith, Jan 27 '20 at 16:48
@FilipePinto You could use amplitude if the mics were perfectly omnidirectional, but phase information is going to be more accurate. You could combine both sources of information to get a better estimate, since they produce different locuses (hyperbolas for phase, and spheres for amplitude, I think) but it probably doesn't matter for this application — endolith, Jan 27 '20 at 16:50
can someone make a device with multiple small microphones in circular array and map the general location of incoming sound onto a screen (via app)? omnidirectional microphone is an option also. the computer monitor can display a blink/flashing dot to computer monitor to help someone with hearing impairment of where a sound is coming from. for example, a blinking dot at a bottom right corner of the monitor will tell if the sound is coming it is from behind/right side of your computer desk, dot at top of monitor for overhead (2-stories, open to workspaces below), dot at left will be sound from l — user51788, Aug 11 '20 at 01:16

Harry Svensson · Accepted Answer · 2020-01-28T03:29:58.700

To extend Müller's answer,

Should the microphones be placed in separate tubes in order to improve separation?

No, you are trying to identify the direction of the source, adding tubes will only make the sound bounce inside the tube which is definitely not wanted.

The best course of action would be to make them face straight up, this way they will all receive similar sound and the only thing that is unique about them are their physical placements which will directly affect the phase. A 6 kHz sine wave has a wavelength of $\frac{\text{speed of sound}}{\text{sound frequency}}=\frac{343\text{ m/s}}{6\text{ kHz}}=5.71\text{ mm}$. So if you want to uniquely identify the phases of sine waves up to 6 kHz, which are the typical frequencies for human talking, then you should space the microphones at most 5.71 mm apart. Here is one item that has a diameter that is less than 5.71 mm. Don't forget to add a low pass filter with a cut-off frequency at around 6-10 kHz.

Edit

I felt that this #2 question looked fun so I decided to try to solve it on my own.

Can phase correlation be calculated between 3 sources simultaneously somehow? (i.e. in order to speed up the computation)

If you know your linear algebra, then you can imagine that you have placed the microphones in a triangle where each microphone is 4 mm away from each other making each interior angles $60°$.

So let's assume they are in this configuration:

       C
      / \
     /   \
    /     \
   /       \
  /         \
 A - - - - - B

I will...

use the nomenclature $\overline{AB}$ which is a vector pointing from $A$ to $B$
call $A$ my origin
write all numbers in mm
use 3D math but end up with a 2D direction
set the vertical position of the microphones to their actual wave form. So these equations are based on a sound wave that looks something like this.
Calculate the cross product of these microphones based on their position and waveform, then ignore the height information from this cross product and use arctan to come up with the actual direction of the source.
call $a$ the output of the microphone at position $A$, call $b$ the output of the microphone at position $B$, call $c$ the output of the microphone at position $C$

So the following things are true:

$A=(0,0,a)$
$B=(4,0,b)$
$C=(2,\sqrt{4^2-2^2}=2\sqrt{3},c)$

This gives us:

$\overline{AB} = (4,0,a-b)$
$\overline{AC} = (2,2\sqrt{3},a-c)$

And the cross product is simply $\overline{AB}×\overline{AC}$

$$ \begin{align} \overline{AB}×\overline{AC}&= \begin{pmatrix} 4\\ 0\\ a-b\\ \end{pmatrix} × \begin{pmatrix} 2\\ 2\sqrt{3}\\ a-c\\ \end{pmatrix}\\\\ &=\begin{pmatrix} 0\cdot(a-c)-(a-b)\cdot2\sqrt{3}\\ (a-b)\cdot2-4\cdot(a-c)\\ 4\cdot2\sqrt{3}-0\cdot2\\ \end{pmatrix}\\\\ &=\begin{pmatrix} 2\sqrt{3}(b-a)\\ -2a-2b-4c\\ 8\sqrt{3}\\ \end{pmatrix} \end{align} $$

The Z information, $8\sqrt{3}$ is just junk, zero interest to us. As the input signals are changing, the cross vector will swing back and forth towards the source. So half of the time it will point straight to the source (ignoring reflections and other parasitics). And the other half of the time it will point 180 degrees away from the source.

What I'm talking about is the $\arctan(\frac{-2a-2b-4c}{2\sqrt{3}(b-a)})$ which can be simplified to $\arctan(\frac{a+b+2c}{\sqrt{3}(a-b)})$, and then turn the radians into degrees.

So what you end up with is the following equation:

$$\arctan\Biggl(\frac{a+b+2c}{\sqrt{3}(a-b)}\Biggr)\frac{180}{\pi}$$

But half the time the information is literally 100% wrong, so how.. should one.... make it right 100% of the time?

Well if $a$ is leading $b$, then the source can't be closer to B.

In other words, just make something simple like this:

source_direction=atan2(a+b+2c,\sqrt{3}*(a-b))*180/pi;
if(a>b){
   if(b>c){//a>b>c
     possible_center_direction=240; //A is closest, then B, last C
   }else if(a>c){//a>c>b
     possible_center_direction=180; //A is closest, then C last B
   }else{//c>a>b
     possible_center_direction=120; //C is closest, then A last B
   }
}else{
   if(c>b){//c>b>a
     possible_center_direction=60; //C is closest, then B, last A
   }else if(a>c){//b>a>c
     possible_center_direction=300; //B is closest, then A, last C
   }else{//b>c>a
     possible_center_direction=0; //B is closest, then C, last A
   }
}

//if the source is out of bounds, then rotate it by 180 degrees.
if((possible_center_direction+60)<source_direction){
  if(source_direction<(possible_center_direction-60)){
    source_direction=(source_direction+180)%360;
  }
}

And perhaps you only want to react if the sound source is coming from a specific vertical angle, if people talk above the microphones => 0 phase change => do nothing. People talk horizontally next to it => some phase change => react.

$$ \begin{align} |P| &= \sqrt{P_x^2+P_y^2}\\ &= \sqrt{3(a-b)^2+(a+b+2c)^2}\\ \end{align} $$

So you might want to set that threshold to something low, like 0.1 or 0.01. I'm not entirely sure, depends on the volume and frequency and parasitics, test it yourself.

Another reason for when to use the absolute value equation is for zero crossings, there might be a slight moment for when the direction will point in the wrong direction. Though it will only be for 1% of the time, if even that. So you might want to attach a first order LP filter to the direction.

true_true_direction = true_true_direction*0.9+source_direction*0.1;

And if you want to react to a specific volume, then just sum the 3 microphones together and compare that to some trigger value. The mean value of the microphones would be their sum divided by 3, but you don't need to divide by 3 if you increase the trigger value by a factor 3.

I'm having issues with marking the code as C/C#/C++ or JS or any other, so sadly the code will be black on white, against my wishes. Oh well, good luck on your venture. Sounds fun.

Also there is a 50/50 chance that the direction will be 180 away from the source 99% of the time. I'm a master at making such mistakes. A correction for this though would be to just invert the if statements for when 180 degrees should be added.

I wonder if the phase thing is really necessary, or if each mic can just look for some identifiable feature. If all mics hear "hey robot" then couldn't they line up the onset of that "bah" sound and ignore phase? Then you shouldn't need to place the mics so close together... — Guest, Feb 25 '18 at 22:25
well, yes, but that would only work up to twice sample period precision, and I don't see how you're gaining anything with that – your estimation variance will go down because detecting the onset of a known sound isn't that different from comparing the recordings of two microphones, only that your "reference" sound isn't inherently as similar (read: well-correlating) as your other microphone's sounds (all asuming your SNR is above 3 dB, but that's a safe bet). — Marcus Müller, Feb 25 '18 at 22:52
To Guest: Well, it depends on if you want to do it right/pedantic correctness, or if you want to do it "okay" and call it a day. I went for the pedantic way because it's more fun. But sure, the "okay" way works as well, space them apart and hope for the best. Might work great, might work horribly. I won't recommend "okay" solutions. @MarcusMüller I can't tell if that's a comment to me or Guest. — Harry Svensson, Feb 25 '18 at 23:21
@HarrySvensson, I see what you mean. I was thinking you could use something like your approach, except $a$, $b$ and $c$ would be a number of milliseconds since the first mic heard the sound. I played around with it here, but it's not lining up perfectly when the source, a mic, and the center of the robot aren't all in a line. I think it might be "okay" though, check it out. Error's not as bad when source is far from mics. I'm sure it could be corrected, but the math escapes me. — Guest, Feb 26 '18 at 06:59
@Guest A "number of milliseconds" spacing would be $(343\text{ m/s})×(0.001\text{ s})=34.3\text{ cm}$ apart. I used 1 ms because 2 ms or above resulted in ridiculous distances. - Well, to tell you the truth, a similar question arose on EE.SE, which I solved, in that question the goal was to locate a sound source with two microphones, hand-held object. And I continued further and actually made a proper simulation. I recommend you to click the link and click on the switch and the transmission line. Double click to edit. Add 1 ms and see what happens. It's not ok — Harry Svensson, Feb 26 '18 at 07:14
@Guest The RC filters on the right under the "Hello" are the output of interest. 2.5 V => phase leads, less than 2.5 V and it shows how much it lags according to this formula $V_{out}=\frac{lag}{180°}×2.5\text{ V}$.. The more spacing you use, the worse it gets. Though my simulation is not 100% fair to you because I am using noise 24/7, rather than short bursts of noise which would simulate vocal noise. But... the correlation will be "forgotten", there's no proper memory in my solution. It's instantaneous. - I won't mind if you come up with a solution that works as you describe. — Harry Svensson, Feb 26 '18 at 07:18
@Guest Also, is that Desmos thing a demo of my math? Or is it a demo of your math? It looks really cool none the less. It looks like you're cheating a little bit, using distance ;), a DC value rather than a sine wave. — Harry Svensson, Feb 26 '18 at 07:28
@HarrySvensson, it's your math, vector towards source is (roughly) cross-product of vectors AB and AC, except it was coming out backwards so I flipped it around, $(C_y\left(a-b\right), C_x(b-a)-B_x(c-a))$. Not sure why it points exactly towards source at 6 different angles, with increasing error between them. I guess this has to do with sinusoidal nature of your $abc$ values vs mine. Point taken about ms; I wonder what kind of timing resolution this robot will have. Will need some sleep before your simulation sinks in. :) — Guest, Feb 26 '18 at 08:29
I don't think the distances are cheating, I subtract the minimum distance from all 3, so the first mic that heard the sound is at 0. Was thinking you could divide them by speed of sound to get the actual times, but that shouldn't affect outcome since we just care about direction and not magnitude (not sure how position could be calculated rather than direction... maybe the robot just keeps going until it bumps into your foot). — Guest, Feb 26 '18 at 08:37
@harrySvensson that comment was for guest! Thank you for your awesome answer + edit — Marcus Müller, Feb 26 '18 at 08:49
Not sure I've ever seen code highlighting working here on SE.DSP. Let me check with the Teacher's Lounge and see what they say. Looks like someone asked on Meta some time ago, but no action was taken: https://dsp.meta.stackexchange.com/questions/133/is-code-syntax-highlighting-turned-on-for-signal-processing — Peter K., Feb 26 '18 at 14:04
Please go and upvote that post on Meta.DSP. I've added the tag feature-request which should at least see some engagement, but we need the votes. If the Chemistry.SE site has it enabled, we should definitely! :-) https://dsp.meta.stackexchange.com/questions/133/is-code-syntax-highlighting-turned-on-for-signal-processing — Peter K., Feb 26 '18 at 14:21
"If only one tube hears the sound due to no reflections around the robot to bounce into either of the other two tubes." That's not how the microphones work. You could do it in an anechoic chamber with no reflections at all, and all 3 mics would still always pick up sound, tube or no tube. The pressure wave passing the opening is what the mic is picking up. It has nothing to do with reflections. — endolith, Jan 27 '20 at 16:44
@FatihÖzyürek it would be the value straight out of the microphone. I'm no pro when it comes to the details but I assume that it is a signed byte, which should be fed straight into the code above. However if I would get the chance to redo this answer then I would not go with the answer I've given. I would go with computing the FFT of all 3 microphones, do some cross correlation to acquire the phase difference -> time difference... The first FFT element is the 0 Hz value, I would only need to look on the phase difference between the 3 microphones second FFT element. Then set a,b,c = phase diff. — Harry Svensson, Jan 28 '20 at 03:39
@endolith You are correct, I have now removed that part. Thank you. — Harry Svensson, Jan 28 '20 at 03:39

score 4 · Answer 2 · answered Feb 25 '18 at 20:01

Yes, this feels reasonable and typical.
You can just as well use the three microphone signals at once (not going the "detour" through your three pair correlations). Look for "MUSIC" and "ESPRIT" in direction-of-arrival applications.
Very likely it is. You're not aiming for high audio quality, you're aiming for good corss-correlation properties, and a few bits here and there will probably not make or break the system. A higher sampling rate like the very common 44.1 kHz or 48 kHz, on the other hand, would instantly double the angular precision, very likely, on same observational length.

Detecting direction of sound using several microphones

2 Answers2

Edit

Linked