The packet will contain ID1, x1, y1; ID2, x2, y2; ID3…etc where x1 and y1 will be the coordinates provided as pixels in a 640x480 array for all the people detected in the field of view.