NumPy: 3-byte, 6-byte types (aka uint24, uint48)

NumPy lacks built-in support for 3-byte and 6-byte types, aka uint24and uint48. I have a large dataset using these types and want it loaded in numpy. What am I doing now (for uint24):

import numpy as np
dt = np.dtype([('head', '<u2'), ('data', '<u2', (3,))])
# I would like to be able to write
#  dt = np.dtype([('head', '<u2'), ('data', '<u3', (2,))])
#  dt = np.dtype([('head', '<u2'), ('data', '<u6')])
a = np.memmap("filename", mode='r', dtype=dt)
# convert 3 x 2byte data to 2 x 3byte
# w1 is LSB, w3 is MSB
w1, w2, w3 = a['data'].swapaxes(0,1)
a2 = np.ndarray((2,a.size), dtype='u4')
# 3 LSB
a2[0] = w2 % 256
a2[0] <<= 16
a2[0] += w1
# 3 MSB
a2[1] = w3
a2[1] <<=8
a2[1] += w2 >> 8
# now a2 contains "uint24" matrix

While it works to enter 100 MB, it looks inefficient (think about 100 GB of data). Is there a more efficient way? For example, creating a special read-only view that masks part of the data would be useful (type type "uint64 with two types of MSB is always zero"). I only need read-only data access.

+5
source share
3 answers

, , ( , ). in-process:

a = np.memmap("filename", mode='r', dtype=np.dtype('>u1'))
e = np.zeros(a.size / 6, np.dtype('>u8'))
for i in range(3):
    e.view(dtype='>u2')[i + 1::4] = a.view(dtype='>u2')[i::3]

unaligned strides:

e = np.ndarray((a.size - 2) // 6, np.dtype('<u8'), buf, strides=(6,))

, .

+6

: dtype Numpy, 24- ?

, , : ndarray, dtype <u3, memmap() .
, () .

, ndarray, . , .

+1

Using the code below, you can read integers of any size encoded as large or small:

def readBigEndian(filename, bytesize):
    with (open(filename,"rb")) as f:
         str = f.read(bytesize)
         while len(str)==bytesize:
             int = 0;
             for byte in map(ord,str):
                 print byte
                 int = (int << 8) | byte
             yield(int)
             str = f.read(bytesize)

def readLittleEndian(filename, bytesize):
    with (open(filename,"rb")) as f:
         str = f.read(bytesize)
         while len(str)==bytesize:
             int = 0;
             shift = 0
             for byte in map(ord,str):
                 print byte
                 int |= byte << shift
                 shift += 8
             yield(int)
             str = f.read(bytesize)

for i in readLittleEndian("readint.py",3):
    print i
0
source

All Articles