First of all, welcome to Stackoverflow!
I have never personally considered using Kinect for image recognition, but if possible, you should reduce the image to a fairly reasonable size, such as 100x100to keep it manageable.
You should also try to convert the image to grayscale, as it will also help with computational efficiency, development time, and it is much easier to start with than with RGB.
The input layer will not be equal to 1, which is specified. If we mean an image with a size of 100x100, the total number of inputs should be 10000, one for each pixel. Remember that you are trying to break the data as small as possible so that ANN can detect patterns in the data.
2 neurons . , , . 2 , (, ) (, ). , 2 , , , .
3 , , , . , , ! , .