* Python: A RAII-like File Interface
*
* Date: 31.10.2024
*\
The *last article treated the PCAP file format and showed how one can generate and read out PCAP files in a fairly simple and portable way using the "ctypes" module, which is a part of the Python standard library. The file interaction in that article was designed in a very direct way, which usually works for simple scripts, but often becomes a problem once software becomes more complex. In this article, we will resolve this problem and reimplement the file handling in a scalable way. The demonstration here will take up the example of PCAP files, but the general principle of the approach to be discussed is applicable across a much wider range of file types.
RAII ("resource acquisition is initialization") is a widely-used approach to encapsulate a resource into a class and make it available to the rest of the program through that particular class' interface. Often, the constructor does the job of acquiring the resource and the destructor is responsible for freeing it. This approach is rooted in languages like C++, where you exactly know that the destructor will be called once a given object representing a resource is destroyed intentionally, goes out of scope or in case there is an exception. As a consequence, you have guarantees on the safety of a program dealing with the respective resource.
The manual memory management mechanism of C++ is of course far more deterministic than the garbage collector-based one of the Python interpreter. The latter is all about reference counts and stepping in some time after the reference count of a given object hits zero. As a consequence, you cannot rely on the destructor of a Python object being executed right after the object leaves scope or even before other libraries (on whose presence you might count in the destructor implementation) are unloaded during program termination, which can lead to ugly errors.
The first thing this means is that you have to be especially careful with destructors in Python and what you do in them and the second consequence is an even more unfortunate one: What you can have in Python is not really RAII. Nevertheless, with some precautions you can have something that resembles RAII (or "RAII-like") and reap many benefits of this approach. I often use an approach like this in my software designs not only in Python but also in other (garbage collector-based) languages and it usually pays off. Now let's construct a file interface for PCAP files based on what we learned in the *last article.
We can define the class and the constructor like in Listing 1.
class PcapFile:
class PcapFileHeader(ctypes.Structure):
_pack_ = 1
_fields_ = [("magic_nr", ctypes.c_uint32),
("ver_maj", ctypes.c_uint16),
("ver_min", ctypes.c_uint16),
("thiszone", ctypes.c_uint32),
("sigfigs", ctypes.c_uint32),
("snaplen", ctypes.c_uint32),
("network", ctypes.c_uint32)]
class PcapFrameHeader(ctypes.Structure):
_pack_ = 1
_fields_ = [("ts_s", ctypes.c_uint32),
("ts_us", ctypes.c_uint32),
("frm_size", ctypes.c_uint32), ("frm_size_wire", ctypes.c_uint32)]
def __init__(self, fname: str):
self.target = open(fname, "ab+")
self.target.seek(0, 0)
header = self.target.read(ctypes.sizeof(self.PcapFileHeader))
if len(header) == 0:
# New file, write the header
header = self.PcapFileHeader(magic_nr=0xa1b2c3d4,
ver_maj=2,
ver_min=4,
thiszone=0,
sigfigs=0,
snaplen=65535,
network=1)
self.target.write(bytes(header))
elif len(header) < ctypes.sizeof(self.PcapFileHeader):
# Cannot be a PCAP file - too short!
raise Exception("Not a PCAP file!")
else:
# File is long enough, check the magic number
header = self.PcapFileHeader.from_buffer_copy(header)
if header.magic_nr != 0xa1b2c3d4:
raise Exception("Magic number wrong. Not a PCAP file!")
self._lastpos = ctypes.sizeof(self.PcapFileHeader)
In Listing 1, just like in the previous article, we introduce the classes "PcapFileHeader" and "PcapFrameHeader" describing the layout of the PCAP global file header and the individual PCAP frame headers respectively. As we have them defined inside the class as attributes, their visibility is reduced (primarily) to the class, where they are in fact used. Note that how the definition of classes inside classes is a good example of the dynamic nature of Python and that the classes "PcapFileHeader" as well as "PcapFrameHeader" are in fact mutable objects. Furthermore, be aware that these class objects are shared between all instances of "PcapFile" and that you must be very careful with them; for further details on the problems, that could ensue as consequences of inconsiderate accesses to these objects, consult the *article on classes with mutable attributes. With those two header definitions in place we can move on to defining the constructor.
The first thing the constructor does is open the file, whose name has been set in its only argument "fname". The mode we will use is "ab+", because
- we want to open an existing file (hence not "w", which would truncate it) and we intend to create a new file of the name contained in "fname" if such a file does not exist (hence not "r"),
- the opened file shall be readable and writeable (therefore "+") and
- we want to read and write binary data (as opposed to text, hence "b").
After the constructor has been executed, the stream position of the file object is located at the byte after the PCAP header, which will be our starting point for reading frames from the file (if there are any). We create an attribute "_lastpos" containing this position for reasons I will explain in the next section, where the corresponding method will be developed.
In an interface for PCAP files (or similar file types) it makes sense to only have a method for appending new data and omit the possiblity to write them at arbitrary positions: Consider the interplay of PCAP headers and PCAP data presented in the *last article and suppose you would try to "exchange" an existing frame for a new one. If the new frame is shorter than the old one, the already existing header of the following PCAP frame would not be located right after the new frame's end any more. Consequently, software reading that manipulated file would suppose that it has found the next header, when it is actually leftover data, and would quickly crash. The same is true when the new frame is longer than the old one, as the next header would be partly or completely overwritten with the effect, that the integrity of the PCAP file is again destroyed.
The methods for reading frames from a PCAP file and for appending them are implemented in Listing 2, which is meant as an addition to the code from Listing 1.
def append(self, data: bytes) -> None:
frame_header =\
self.PcapFrameHeader(ts_s=0,
ts_us=0,
frm_size=len(data),
frm_size_wire=len(data))
self.target.write(bytes(frame_header) + data)
def read(self) -> bytes:
ret = b""
self.target.seek(self._lastpos, 0)
frame_header = self.target.read(ctypes.sizeof(self.PcapFrameHeader))
if len(frame_header) != 0:
frame_struct = self.PcapFrameHeader.from_buffer_copy(frame_header)
frame_data = self.target.read(frame_struct.frm_size)
ret = frame_data
self._lastpos = self.target.tell()
return ret
def rewind(self) -> None:
self._lastpos = ctypes.sizeof(self.PcapFileHeader)
The "append" method in Listing 2 is built on the fact that at the time of writing, the Python interpreter uses "open" together with the flag O_APPEND on Linux and "_wopen" together with _O_APPEND on Windows for opening files with the mode "ab+" behind the scenes. These two functions together with the respective flags assure that before every write operation, the stream pointer is moved to the end of the file by the operating system (see *here and *here). With a file opened in mode "ab+" on these platforms, every write operation is therefore actually an append operation.
In "append", we act like in the previous article: We generate a new PCAP frame header according to the predetermined layout, append the data portion and write (i.e. append) everything to the file that has already been opened in binary mode. Note that we left out the timestamp and set it to zero when writing the frame to the PCAP file for reasons of simplicity. This can of course be easily changed (with e.g. a parameter of the type "real", which is subsequently split into its second and microsecond component and added to the frame header).
In the "read" method we want to assure, that we keep the read and write pointers of the file separate, such that we can retrieve the PCAP frame data one after the other, even if we append frames in between read operations. This is what the "_lastpos" attribute is for, which we had initially set to the byte after the PCAP file header in the constructor (which is where the first byte of the first PCAP frame header would be).
The method starts with the variable "ret", which contains the value the method will eventually return to the caller, set to a default return value b"". In case we do not successfully read out any data from the file (most probably because we hit the end of the file) this will never be replaced and will consequently be what the caller receives. The empty bytes object is unambiguous in showing problems during readout, because the presence of an empty header without attached data does not make sense in the PCAP context; this header would never be written and thus we will never return b"" while regularly iterating through the file. Should we successfully read out data from the file, the variable "ret" is set to contain this data in the if-block.
Next, we take the last saved read pointer position from "self._lastpos" and readjust the file stream pointer such that we start reading the next data block at that very position. At the end of the "read" method implementation you can see, that "self._lastpos" is updated with the position after reading out the next whole PCAP frame, which means that it again points to the first byte of a PCAP frame header, this time it is the one of the following frame. In case there was nothing to read out, "self._lastpos" is unchanged; this is nice, because it has already been set to the position of the first byte of the header of a possible future PCAP frame, that we have not yet appended to the file and which could later be read out after we have added it. It is in this way that we assure we can always iterate through the file in a consistent way.
Having set the read pointer position correctly, we first try to read out the PCAP frame header. If this succeeds (i.e. the received frame header is not an empty bytes object) we interpret the frame header and extract the length of the PCAP data section (which has been explained in detail in the *last article). The subsequently retrieved data block will be written into the variable "ret" and is the result of the method call, that will be returned. Should the readout fail, b"" would be the return value, as already explained above.
"rewind" is the last method defined in Listing 2 and simply resets the read pointer to the first byte of the header of the first PCAP frame. This is the same position, that has been set after initialization of the "PcapFile" object by the constructor, which simply means, that the "read" method will return the data section of the first PCAP frame again in its next call and then continue to iterate through the file again from the beginning.
After methods for reserving a resource, reading it out and writing to it have been introduced, only the logic freeing the resource (in our case the PCAP file) again is missing in order to have a proper RAII interface. At this point, we are back at the problems outlined in the section about the RAII concept: As already explained there, this task is usually taken over by the destructor of the object representing the resource, but we cannot rely on the destructor being called in time by the Python interpreter's garbage collector, which in turn can lead to severe problems in the general case. What you can usually do to alleviate this is to define a dedicated method for closing the resource (which will be fittingly called "close" in our example) and manually call it in the places where you know it needs to be called. That way you lose some of the convenience of RAII, that you might have in non-managed languages like C++, but you can still have a RAII-like interface incorporating many of the RAII benefits.
def close(self):
if not self.target.closed:
self.target.close()
def __delete__(self):
self.close()
def __iter__(self):
return self
def __next__(self):
next_bytes = self.read()
if next_bytes == b"":
raise StopIteration
else:
return next_bytes
# End of class PcapFile
Listing 3 is again meant as a continuation of Listing 2 and defines additional methods for the "PcapFile" class. "close" is the method that has already been alluded to and mentioned above: It first checks, whether the file, that is being abstracted by the "PcapFile" class has already been closed and closes it, if this is not the case. Checking the status of the file is important because it may not be closed twice and the method "close" is also called in the destructor (in case someone forgot to call "close"), which in turn will be called during program termination.
The definition of the method "__iter__" is there to comply with the iterator protocol by returning the object itself as an iterator. Eventually, the "__next__" method converts the protocol we defined (remember that b"" meant we have reached the end of the file or that we encountered a more severe problem) for the "read" method into something that a caller expecting the iterator protocol will understand: Such a caller will need a "StopIteration" exception raised to understand, that the iterator is exhausted. The "__next__" method organizes this once it reads b"" from the file, while routing through all other data unchanged.
The fact that the "PcapFile" class supports the iterator protocol makes file handling a good deal more elegant, as you will see in the following section. With the end of the definition of "__next__", the "PcapFile" class we constructed is complete; we will now continue with the setup of a small application using the prepared infrastructure.
In the last section, we want to use the class "PcapFile" in an application, so note that while Listing 4 is again a continuation of Listing 3, this part does not belong to the class "PcapFile" any more.
target = PcapFile("some_frames.pcap")
# Write
for i in range(0, 5):
target.append(i.to_bytes(1, "big") * 60)
target.close()
# Read
def handler(frame: bytes) -> None:
print(frame)
target = PcapFile("some_frames.pcap")
for frame in target:
handler(frame)
target.close()
The application in Listing 4 is modeled after the process developed in the *last article and basically does the same thing: The first action is to open (and, in this case, create) a PCAP file "some_frames.pcap" in accordance with the RAII principle by instantiating the class "PcapFile". Then we append five frames using the object's "append" method in the exact way, that was outlined in the last article. Having written the frames, we close the file manually to demonstrate that re-opening the file works, which we do in the step thereafter. Just like in the last article, we then define a "handler" function, that decides what will happen with the data that we read out from our generated file, read out each frame from the file and pass it to the handler, which prints it to the screen. Eventually, we manually close the file again.
It is obvious that using a RAII-inspired (and object-oriented) approach leads to a more scalable and tidier program structure than we had in the last article, where we developed the program in a more ad-hoc fashion and emphasized the relationship between the code and the structure of PCAP files. The most notable thing, where we departed from the RAII-approach, is the aforementioned manual closing of the file "some_frames.pcap", which would be left to the (automatic) execution of the object's destructor in a true RAII implementation. In constrast, we close the file manually, when we know that the file object is about to be destroyed by the interpreter and we want to ensure that the garbage collector does not fail us with respect to closing it (which could lead to data loss).
It has to be noted that the usage of the iterator protocol together with the chosen object-oriented and RAII-inspired approach plays its own important role in the simplification of the code in Listing 4: Obviously no more than two lines are needed to implement the PCAP parser in the application now, while everything else (including the test for hitting the end of the file) happens behind the scenes.
You have seen, that the algorithms for writing and reading data to and from a PCAP file outlined in the *last article can also be implemented in a very different way. The focus of the last article was more on the algorithms themselves and how to interact with a given PCAP file in a very direct way, while the present article takes these algorithms and implements them in a way that is scalable, clean and (most importantly) re-usable. The class "PcapFile" is generic enough to work in a library setting and it could be a building block of several different scripts or even software projects, which import it as a module.
Notably, the showcased RAII-like approach helped with the construction of such a re-usable class and even though in the end (as mentioned) you will not get true RAII in Python due to the way the garbage collector works, you can still reap many of its benefits if you are careful with the way you use the destructor and if you are ready to manually close your files when this is needed.