GitHub - filiparag/ftn-tcp-send: File transfer architecture based on dual-stack TCP

Task

The goal of this project was to realize a fast file transfer system based on a dual-stack TCP architecture, enabling communication via both IPv4 and IPv6 protocols. The project was inspired by well-known download managers (such as GetRight, FlashGet, and GoZilla), and the task was to implement a client and server that exchange file segments using the TCP protocol.

The server should support multiple clients simultaneously, while the client transfers file parts through one or more connections and reconstructs the original file after reception. Additionally, the project includes the application of the Playfair algorithm for packet content encryption and decryption: the server encrypts the file content during transmission, while the client performs the corresponding decryption. The password for forming the Playfair matrix is determined by the server administrator.

Solution Concept

The designed solution uses a binary protocol for transferring files in chunks, where both client and server can have multiple worker threads in parallel, with each thread establishing a separate long-lived TCP connection via either IPv4 or IPv6 protocol. The server dictates the chunk size, transfer rate, and encryption, which is applied to the file content during transmission. The client chooses between serial and parallel downloading based on the file size and number of worker threads.

Solution Description

The following text presents the interface protocol as well as a description of the provided reference implementation.

Protocol

A simple binary protocol called TCPsend was created for client-server cooperation during file downloads, which boils down to the request-response concept, where a positive response can be a single message or a stream of messages - packets. The following text describes the messages that exist in communication using this protocol, and their logical content is presented in Tables 1.

Request
`str` name
`u64` start
`u64` size

Found
`str` name
`u64` size
`u64` start
`[u8]` data

Not Found
`str` name

Tables 1: Logical representation of messages

Request is a packet containing the file name, starting byte, and size of the requested chunk. If the start and size are both zero, it's a metadata request; otherwise, it's a data request.

Found is a packet containing the name and size of the entire file, the starting byte of the chunk, and the chunk data. The chunk data returned in the response is always less than or equal to the requested size, but the server has the freedom to respond with a stream of such packets until it has sent the requested number of bytes in total.

If the client requested metadata, the starting byte field equals zero, and the data field is empty.

Not found is a packet containing only the file name, indicating that it doesn't exist on the server.

Request

Found

Not Found

Length	Field Description
`6`	first magic
`16`	name length
`[..]`	name
`6`	second magic
`1`	variant
`8`	start
`8`	chunk size

Length	Field Description
`6`	first magic
`16`	name length
`[..]`	name
`6`	second magic
`1`	variant
`8`	file size
`8`	start
`8`	data size
`[..]`	data

Length	Field Description
`6`	first magic
`16`	name length
`[..]`	name
`6`	second magic
`1`	variant

Tables 2: Representation of message packets in memory

The Tables 2 describe the representation of these message packets in memory during transport through the TCP connection. All multi-byte numeric types are in little-endian notation, and names are in UTF-8 encoding without a terminating character. The algorithm for reading incoming packets within the reference implementation is presented in the additional Diagram 7.

The first magic value 0xAABBCCDDEEFF serves to locate the beginning of the next packet within the incoming data stream, as there's a possibility of arbitrary data between valid packets, including incomplete packets. The second magic value 0x122334455667 is a delimiter for the name field, which verifies that the preceding length and name value were correctly interpreted.

The [..] notation in the table indicates variable-length fields, and numeric values express length in bytes. Since the beginning of all packets is identical, the variant field indicates the type of message: 0x01 is a request, 0x02 is a positive response, and 0xFF indicates that the chunk (or file) was not found.

Implementation

The provided implementation is written in the Rust programming language, and can run on all operating systems supported by the compiler without external dependencies. Program configuration is done through environment variables at startup.

Server

At startup, the server binds to the specified socket, populates the file list, starts a pool of worker threads, enters a waiting state, and responds to incoming requests. For a graphical representation of the state flow during processing of a single request, see the additional Diagram 7.

Each time it receives a new request, it adds it to the queue and notifies the first available worker thread that there's a new request. After a worker thread accepts a request, it checks whether the file exists in the in-memory list, so the response is faster and doesn't burden the repository - the directory with available files. The list is not updated during the server's lifetime, as the repository content is assumed to be immutable.

If it's a metadata request, it responds immediately based on information from the list. Otherwise, it opens the file, seeks to the specified chunk start $P$, and reads the content. The amount of content it reads depends on client and server settings, so if the client requests $V_K$ bytes per chunk and the server is allowed to send $V_S$ bytes per chunk, the smaller of these two values is always chosen: $V = \min(V_K, V_S)$.

When $V_K \leq V_S$, the server sends one chunk of size $V_K$, and when $V_K > V_S$ it sends a stream of chunks until it reaches the end of the file $D$ or the requested size: $\sum_i V_i = \min(D - P, V_K)$.

After sending the response, the TCP connection remains open as long as the client doesn't close it or it expires due to inactivity, and the file handle is closed, since the server architecture doesn't maintain session state between requests. This way, the client isn't obligated to use one connection for only one file, but can request metadata and data for arbitrary parts of multiple files through the same connection.

If encryption is enabled, the content is transparently encrypted during file reading, and all other steps remain unchanged. Files are expected to contain only letters of the English alphabet in this case, otherwise undefined behavior occurs in the encrypted representation.

After each response is sent, whether part of a stream or not, the server applies bandwidth limiting by waiting for a certain time period. The waiting doesn't adversely affect server performance, since requests on other connections can be serviced during it.

Client

At startup, the client receives file names for download through command-line positional arguments. Files are downloaded separately, although the protocol allows interleaved downloading, due to the simplicity of the reference implementation. The following text describes the procedure performed for each file individually.

Based on the allowed number of worker threads $W$, the client attempts to establish the same number of TCP connections $C_4$ and $C_6$ to the server. Depending on whether IPv4, IPv6, or both server addresses are specified, connections are distributed evenly, but IPv6 always has priority. The number of connections is determined as follows when both address types are available:

$$C_6 = \begin{cases} 1 & \text{if } W \leq 2 \ \lceil W / 2 \rceil & \text{otherwise} \end{cases}$$

$$C_4 = W - C_6$$

When only one address type is available, all connections are of that type. Connections are distributed so that IPv6 comes first, followed by IPv4.

After establishing a connection, the client requests file metadata from the server via the first connection. If the file doesn't exist, the procedure stops (Diagram 1). Based on the file size obtained from metadata and the number of connections, the client decides whether it's worthwhile to use only one or multiple threads (serial or parallel) using the heuristic $D \lessgtr V_{\text{min}} \cdot W$, for chosen $V_{\text{min}} = 32 \text{ bytes}$.

Diagram 1: Attempt to download a non-existent file

Serial downloading is a special case of parallel downloading described below. The main difference visible to the user is that during serial downloading, the file is not split into multiple parts that are joined into one at the end. The Diagram 2 presents the serial download process of one file in two chunks, and the entire download procedure is described in additional Diagram 4 and Diagram 5.

At the beginning of parallel downloading, the client starts $W$ worker threads and passes one TCP connection to each. Each creates its own part in the working directory and fills it with zeros, so that each has the same size $D_W = \lceil D/W \rceil$, except the last one which may be smaller due to the division remainder.

Then, thread $n$ requests $\lceil D_W / V_K \rceil$ file chunks from the server, starting from the beginning of its file part $P_n = D_W \cdot n$. The client then writes each received chunk $i$ of size $V_i$ at the specified location $P_i$ in the file part. Information about the write size and location is taken from the response header, since the server isn't obligated to respond to a request with exactly as much data as the client requests, and may also split the response into multiple packets (Diagram 3).

Diagram 2: Complete serial file download procedure

When a thread receives all $D_W$ bytes of its file part, it closes the connection with the server and returns the file part handle to the main thread. The main thread opens an empty target file in the working directory and sequentially writes the content of each file part into it when the worker thread finishes downloading. After copying the content, it deletes that file part from the working directory and moves to the next one, until all parts are copied.

As with the server, if encryption is enabled, the content is transparently decrypted during writing to file parts. The user provides the key at program startup and it must match the one on the server. If the key is not provided or is incorrect, the file content will remain encrypted or will be incorrect.

Diagram 3: Server response as a packet stream

Testing

As part of verifying the performance and correctness of the reference implementation, their behavior was tested according to multiple criteria. Below, the methodology is described for each and experimental results are presented.

Content Authenticity

The basic expectation of any content sharing server and download program is that data within exchanged files remains unchanged. Verification was performed by sending and downloading multiple types of files (text, image, audio, compressed archives) of different sizes (from a few bytes to several gigabytes), and then the SHA-2 checksum of the source and obtained file was verified. In all cases, the checksums matched, regardless of whether the download was serial or parallel.

Playfair Encryption

Since this encryption only works correctly on text files, the phrase thequickbrownfoxjumpsoverthelazydog was written to a plain text file without a newline at the end for testing. The password rtrk was used.

When encryption is enabled on the server but not on the client, the content of the downloaded file is tgnvombuqslzgblqmohtpksuinapgzyenzu. When the correct password is entered, the content matches the original, and for an incorrect password netacno, the content becomes qkkyhvtqkhpgcpwvhlqsuztdmalizynucw.

The procedure was repeated on a larger text file (64 KiB) with the same outcome.

Download Speed

To measure file download speed depending on program settings, a script was written that automates running experiments and collecting results.

Measurements were performed on a computer with a Linux 6.16 kernel, AMD Ryzen 9 7900 processor with 12 cores (with two threads each) and maximum clock speed of 5.68 GHz, and G.SKILL 32GB Trident Z5 Neo DDR5 RAM at 6000 MHz and CAS latency of 30 cycles.

On a virtual disk in RAM, a repository directory was created with binary files of random content in sizes that can be seen in Figure 1. Each file was downloaded by the client with the number of worker threads between 1 and 10, and the default chunk size of 1024 bytes. For each combination of file size and number of worker threads, the total client execution time for downloading that single file was measured, with different server bandwidth limitations, which can be seen in the Figure 2. Finally, each download was repeated 3 times and the median value was taken to reduce the impact of process scheduling. A total of 1830 different measurements were performed - parameter combinations that wouldn't occur together were discarded (e.g., downloading 1 GiB over a connection limited to 56 Kib/s).

Serial and Parallel Downloading

In the Figure 1, the expected downward trend is clearly expressed by increasing the number of worker threads, even when bandwidth is unlimited. It also confirms that for small files, establishing multiple connections isn't useful, since the size of chunks transferred over each connection is so small that the splitting and joining procedure far exceeds the time to send data over the network.

Figure 1: File download time depending on the number of worker threads

Bandwidth Limitation

For a representative sample of execution time comparison relative to bandwidth, a 64 KiB file was chosen, as it's small enough for slower limitations but large enough compared to the default chunk size.

In the Figure 2, the same drop in execution time is also observed, and values for serial downloading are very close to theoretical maximums: 9140 ms for 56 Kib/s, 4000 ms for 128 Kib/s, 500 ms for 1 Mib/s... This confirms that the reference implementation of limiting is correct.

Figure 2: Download time for 64 KiB depending on connection bandwidth

Comparison with Other Solutions

FileZilla release 3.69 was chosen as a representative example of existing download programs. The download speed of a 1 GiB file via FTP protocol was compared.

Compared to the setup in previous experiments, the number of TCPsend client worker threads was set to exactly 8, and both server and client were allowed to send chunks up to 1 MiB in size, to make the comparison fairer.

It was measured that the FileZilla client takes 892 ms, while the TCPsend client takes 759 ms, thus showing that the presented implementation has similar performance to freely available existing solutions.

Conclusion

Based on the presented experiments, it can be concluded that the described protocol and reference implementation is satisfactory for basic file transfer needs over the network. Although the results show performance of the same order of magnitude as commercial solutions, due to the insecure encryption algorithm that applies only to data but not metadata, demonstrative academic programs of this nature are not recommended for use except for educational purposes.

Appendix

Below are diagrams with details of state changes and algorithms of the reference implementation:

Diagram 4: File download procedure by the client

Diagram 5: Chunk download procedure by the client

Diagram 6: Request processing procedure by the server

Diagram 7: Finding the next packet in the incoming stream procedure

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
doc		doc
experiment		experiment
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Task

Solution Concept

Solution Description

Protocol

Implementation

Server

Client

Testing

Content Authenticity

Playfair Encryption

Download Speed

Serial and Parallel Downloading

Bandwidth Limitation

Comparison with Other Solutions

Conclusion

Appendix

About

Uh oh!

Languages

License

filiparag/ftn-tcp-send

Folders and files

Latest commit

History

Repository files navigation

Task

Solution Concept

Solution Description

Protocol

Implementation

Server

Client

Testing

Content Authenticity

Playfair Encryption

Download Speed

Serial and Parallel Downloading

Bandwidth Limitation

Comparison with Other Solutions

Conclusion

Appendix

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages