Customize Engine Settings
In this guide, we'll walk you through the process of customizing your engine settings by configuring the nitro.json
file
- Navigate to the
App Settings
>Advanced
>Open App Directory
>~/jan/engine
folder.
- MacOS
- Windows
- Linux
cd ~/jan/engines
C:/Users/<your_user_name>/jan/engines
cd ~/jan/engines
- Modify the
nitro.json
file based on your needs. The default settings are shown below.
~/jan/engines/nitro.json
{
"ctx_len": 2048,
"ngl": 100,
"cpu_threads": 1,
"cont_batching": false,
"embedding": false
}
The table below describes the parameters in the nitro.json
file.
Parameter | Type | Description |
---|---|---|
ctx_len | Integer | Typically set at 2048 , ctx_len provides ample context for model operations like GPT-3.5 . (Maximum: 4096 , Minimum: 1 ) |
ngl | Integer | Defaulted at 100 , ngl determines GPU layer usage. |
cpu_threads | Integer | Determines CPU inference threads, limited by hardware and OS. (Maximum determined by system) |
cont_batching | Integer | Controls continuous batching, enhancing throughput for LLM inference. |
embedding | Integer | Enables embedding utilization for tasks like document-enhanced chat in RAG-based applications. |
tip
- By default, the value of
ngl
is set to 100, which indicates that it will offload all. If you wish to offload only 50% of the GPU, you can setngl
to 15 because most models on Mistral or Llama are around ~ 30 layers. - To utilize the embedding feature, include the JSON parameter
"embedding": true
. It will enable Nitro to process inferences with embedding capabilities. Please refer to the Embedding in the Nitro documentation for a more detailed explanation. - To utilize the continuous batching feature for boosting throughput and minimizing latency in large language model (LLM) inference, include
cont_batching: true
. For details, please refer to the Continuous Batching in the Nitro documentation.
Assistance and Support
If you have questions, please join our Discord community for support, updates, and discussions.